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Preface 


Since this is a textbook we biased our selection of references towards easily 
accessible work rather than the original references. While this may not be 
in the interest of the inventors of these concepts, it greatly simplifies access 
to those topics. Hence we encourage the reader to follow the references in 
the cited works should they be interested in finding out who may claim 
intellectual ownership of certain key ideas. 
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Introduction 


Over the past two decades Machine Learning has become one of the main- 
stays of information technology and with that, a rather central, albeit usually 
hidden, part of our life. With the ever increasing amounts of data becoming 
available there is good reason to believe that smart data analysis will become 
even more pervasive as a necessary ingredient for technological progress. 
The purpose of this chapter is to provide the reader with an overview over 
the vast range of applications which have at their heart a machine learning 
problem and to bring some degree of order to the zoo of problems. After 
that, we will discuss some basic tools from statistics and probability theory, 
since they form the language in which many machine learning problems must 
be phrased to become amenable to solving. Finally, we will outline a set of 
fairly basic yet effective algorithms to solve an important problem, namely 
that of classification. More sophisticated tools, a discussion of more general 
problems and a detailed analysis will follow in later parts of the book. 


1.1 A Taste of Machine Learning 


Machine learning can appear in many guises. We now discuss a number of 
applications, the types of data they deal with, and finally, we formalize the 
problems in a somewhat more stylized fashion. The latter is key if we want to 
avoid reinventing the wheel for every new application. Instead, much of the 
art of machine learning is to reduce a range of fairly disparate problems to 
a set of fairly narrow prototypes. Much of the science of machine learning is 
then to solve those problems and provide good guarantees for the solutions. 


1.1.1 Applications 


Most readers will be familiar with the concept of web page ranking. That 
is, the process of submitting a query to a search engine, which then finds 
webpages relevant to the query and which returns them in their order of 
relevance. See e.g. Figure 1.1 for an example of the query results for “ma- 
chine learning”. That is, the search engine returns a sorted list of webpages 
given a query. To achieve this goal, a search engine needs to ‘know’ which 
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Google machine learning Sn pe 

Web Scholar Results 1 - 10 of about 10,500,000 for machine learning. (0.06 seconds) 
Machine learning - Wikipedia, the free encyclopedia Sponsored Links 

As a broad subfield of artificial intelligence, machine learning is concerned with the design 

and development of algorithms and techniques that allow ... Machine Learning 


en.wikipedia.org/wiki/Machine_learning - 43k - Cac 


~ Similar pages Google Sydney needs machine 
learning experts. Apply today! 

Machine Learning textbook www.google.com.au/jobs 

Machine Learning is the study of computer algorithms that improve automatically through 

experience. Applications range from datamini ms t 

www.cs.cmu.edu/~tom/mibook.html - 4k - Cai a 


machine learning 
www.aaai.org/AlTopics/htmi/machine.html - Similar 


pages 


Machine Learning 
A list of links to papers and other resources on machine learning. 
www.machinelearning.net/ - 14k - Cached - Similar pages 


Introduction to Machine Learning 

This page has pointers to my draft book on Machine Learning and to its individual 
chapters. They can be downloaded in Adobe Acrobat format. ... 
ai.stanford.edu/~nilsson/mibook.html - 15k - Cached - Similar pages 


Fig. 1.1. The 5 top scoring webpages for the query “machine learning” 


pages are relevant and which pages match the query. Such knowledge can be 
gained from several sources: the link structure of webpages, their content, 
the frequency with which users will follow the suggested links in a query, or 
from examples of queries in combination with manually ranked webpages. 
Increasingly machine learning rather than guesswork and clever engineering 
is used to automate the process of designing a good search engine [RPBO6]. 

A rather related application is collaborative filtering. Internet book- 
stores such as Amazon, or video rental sites such as Netflix use this informa- 
tion extensively to entice users to purchase additional goods (or rent more 
movies). The problem is quite similar to the one of web page ranking. As 
before, we want to obtain a sorted list (in this case of articles). The key dif- 
ference is that an explicit query is missing and instead we can only use past 
purchase and viewing decisions of the user to predict future viewing and 
purchase habits. The key side information here are the decisions made by 
similar users, hence the collaborative nature of the process. See Figure 1.2 
for an example. It is clearly desirable to have an automatic system to solve 
this problem, thereby avoiding guesswork and time [BI07]. 

An equally ill-defined problem is that of automatic translation of doc- 
uments. At one extreme, we could aim at fully understanding a text before 
translating it using a curated set of rules crafted by a computational linguist 
well versed in the two languages we would like to translate. This is a rather 
arduous task, in particular given that text is not always grammatically cor- 
rect, nor is the document understanding part itself a trivial one. Instead, we 
could simply use examples of translated documents, such as the proceedings 
of the Canadian parliament or other multilingual entities (United Nations, 
European Union, Switzerland) to learn how to translate between the two 
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languages. In other words, we could use examples of translations to learn 
how to translate. This machine learning approach proved quite successful 
[?]. 

Many security applications, e.g. for access control, use face recognition as 
one of its components. That is, given the photo (or video recording) of a 
person, recognize who this person is. In other words, the system needs to 
classify the faces into one of many categories (Alice, Bob, Charlie, ...) or 
decide that it is an unknown face. A similar, yet conceptually quite different 
problem is that of verification. Here the goal is to verify whether the person 
in question is who he claims to be. Note that differently to before, this 
is now a yes/no question. To deal with different lighting conditions, facial 
expressions, whether a person is wearing glasses, hairstyle, etc., it is desirable 
to have a system which learns which features are relevant for identifying a 
person. 

Another application where learning helps is the problem of named entity 
recognition (see Figure 1.4). That is, the problem of identifying entities, 
such as places, titles, names, actions, etc. from documents. Such steps are 
crucial in the automatic digestion and understanding of documents. Some 
modern e-mail clients, such as Apple’s Mail.app nowadays ship with the 
ability to identify addresses in mails and filing them automatically in an 
address book. While systems using hand-crafted rules can lead to satisfac- 
tory results, it is far more efficient to use examples of marked-up documents 
to learn such dependencies automatically, in particular if we want to de- 


ploy our system in many languages. For instance, while *bush’ and ‘rice’ 
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Data Mining: Practical 
Machine Learning Tools 


Pattern Classification (2nd 
Edition) by Richard O. 


(information Science and Edition) (Prentice Hall etietrs (25) $72.20 Duda and Techniques, Second 
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Fig. 1.2. Books recommended by Amazon.com when viewing Tom Mitchell’s Ma- 
chine Learning Book [Mit97]. It is desirable for the vendor to recommend relevant 
books which a user might purchase. 


Fig. 1.3. 11 Pictures of the same person taken from the Yale face recognition 
database. The challenge is to recognize that we are dealing with the same per- 
son in all 11 cases. 
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HAVANA (Reuters) - The European Union’s top development aid official 
left Cuba on Sunday convinced that EU diplomatic sanctions against 
the communist island should be dropped after Fidel Castro’s 
retirement, his main aide said. 


<TYPE="ORGANIZATION" >HAVANA</> (<TYPE="ORGANIZATION">Reuters</>) - The 
<TYPE="ORGANIZATION">European Union</>’s top development aid official left 
<TYPE="ORGANIZATION">Cuba</> on Sunday convinced that EU diplomatic sanctions 
against the communist <TYPE="LOCATION">island</> should be dropped after 
<TYPE="PERSON">Fidel Castro</>’s retirement, his main aide said. 


Fig. 1.4. Named entity tagging of a news article (using LingPipe). The relevant 
locations, organizations and persons are tagged for further information extraction. 


are clearly terms from agriculture, it is equally clear that in the context of 
contemporary politics they refer to members of the Republican Party. 
Other applications which take advantage of learning are speech recog- 
nition (annotate an audio sequence with text, such as the system shipping 
with Microsoft Vista), the recognition of handwriting (annotate a sequence 
of strokes with text, a feature common to many PDAs), trackpads of com- 
puters (e.g. Synaptics, a major manufacturer of such pads derives its name 
from the synapses of a neural network), the detection of failure in jet en- 
gines, avatar behavior in computer games (e.g. Black and White), direct 
marketing (companies use past purchase behavior to guesstimate whether 
you might be willing to purchase even more) and floor cleaning robots (such 
as iRobot’s Roomba). The overarching theme of learning problems is that 
there exists a nontrivial dependence between some observations, which we 
will commonly refer to as x and a desired response, which we refer to as y, 
for which a simple set of deterministic rules is not known. By using learning 
we can infer such a dependency between x and y in a systematic fashion. 
We conclude this section by discussing the problem of classification, 
since it will serve as a prototypical problem for a significant part of this 
book. It occurs frequently in practice: for instance, when performing spam 
filtering, we are interested in a yes/no answer as to whether an e-mail con- 
tains relevant information or not. Note that this issue is quite user depen- 
dent: for a frequent traveller e-mails from an airline informing him about 
recent discounts might prove valuable information, whereas for many other 
recipients this might prove more of an nuisance (e.g. when the e-mail relates 
to products available only overseas). Moreover, the nature of annoying e- 
mails might change over time, e.g. through the availability of new products 
(Viagra, Cialis, Levitra, ...), different opportunities for fraud (the Nigerian 
419 scam which took a new twist after the Iraq war), or different data types 
(e.g. spam which consists mainly of images). To combat these problems we 
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Fig. 1.5. Binary classification; separate stars from diamonds. In this example we 
are able to do so by drawing a straight line which separates both sets. We will see 
later that this is an important example of what is called a linear classifier. 


want to build a system which is able to learn how to classify new e-mails. 
A seemingly unrelated problem, that of cancer diagnosis shares a common 
structure: given histological data (e.g. from a microarray analysis of a pa- 
tient’s tissue) infer whether a patient is healthy or not. Again, we are asked 
to generate a yes/no answer given a set of observations. See Figure 1.5 for 
an example. 


1.1.2 Data 


It is useful to characterize learning problems according to the type of data 
they use. This is a great help when encountering new challenges, since quite 
often problems on similar data types can be solved with very similar tech- 
niques. For instance natural language processing and bioinformatics use very 
similar tools for strings of natural language text and for DNA sequences. 
Vectors constitute the most basic entity we might encounter in our work. 
For instance, a life insurance company might be interesting in obtaining the 
vector of variables (blood pressure, heart rate, height, weight, cholesterol 
level, smoker, gender) to infer the life expectancy of a potential customer. 
A farmer might be interested in determining the ripeness of fruit based on 
(size, weight, spectral data). An engineer might want to find dependencies 
in (voltage, current) pairs. Likewise one might want to represent documents 
by a vector of counts which describe the occurrence of words. The latter is 
commonly referred to as bag of words features. 

One of the challenges in dealing with vectors is that the scales and units 
of different coordinates may vary widely. For instance, we could measure the 
height in kilograms, pounds, grams, tons, stones, all of which would amount 
to multiplicative changes. Likewise, when representing temperatures, we 
have a full class of affine transformations, depending on whether we rep- 
resent them in terms of Celsius, Kelvin or Farenheit. One way of dealing 
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with those issues in an automatic fashion is to normalize the data. We will 
discuss means of doing so in an automatic fashion. 

Lists: In some cases the vectors we obtain may contain a variable number 
of features. For instance, a physician might not necessarily decide to perform 
a full battery of diagnostic tests if the patient appears to be healthy. 

Sets may appear in learning problems whenever there is a large number of 
potential causes of an effect, which are not well determined. For instance, it is 
relatively easy to obtain data concerning the toxicity of mushrooms. It would 
be desirable to use such data to infer the toxicity of a new mushroom given 
information about its chemical compounds. However, mushrooms contain a 
cocktail of compounds out of which one or more may be toxic. Consequently 
we need to infer the properties of an object given a set of features, whose 
composition and number may vary considerably. 

Matrices are a convenient means of representing pairwise relationships. 
For instance, in collaborative filtering applications the rows of the matrix 
may represent users whereas the columns correspond to products. Only in 
some cases we will have knowledge about a given (user, product) combina- 
tion, such as the rating of the product by a user. 

A related situation occurs whenever we only have similarity information 
between observations, as implemented by a semi-empirical distance mea- 
sure. Some homology searches in bioinformatics, e.g. variants of BLAST 
[AGML90], only return a similarity score which does not necessarily satisfy 
the requirements of a metric. 

Images could be thought of as two dimensional arrays of numbers, that is, 
matrices. This representation is very crude, though, since they exhibit spa- 
tial coherence (lines, shapes) and (natural images exhibit) a multiresolution 
structure. That is, downsampling an image leads to an object which has very 
similar statistics to the original image. Computer vision and psychooptics 
have created a raft of tools for describing these phenomena. 

Video adds a temporal dimension to images. Again, we could represent 
them as a three dimensional array. Good algorithms, however, take the tem- 
poral coherence of the image sequence into account. 

Trees and Graphs are often used to describe relations between collec- 
tions of objects. For instance the ontology of webpages of the DMOZ project 
(www.dmoz.org) has the form of a tree with topics becoming increasingly 
refined as we traverse from the root to one of the leaves (Arts > Animation 
— Anime — General Fan Pages — Official Sites). In the case of gene ontol- 
ogy the relationships form a directed acyclic graph, also referred to as the 
GO-DAG [ABB* 00]. 

Both examples above describe estimation problems where our observations 
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are vertices of a tree or graph. However, graphs themselves may be the 
observations. For instance, the DOM-tree of a webpage, the call-graph of 
a computer program, or the protein-protein interaction networks may form 
the basis upon which we may want to perform inference. 

Strings occur frequently, mainly in the area of bioinformatics and natural 
language processing. They may be the input to our estimation problems, e.g. 
when classifying an e-mail as spam, when attempting to locate all names of 
persons and organizations in a text, or when modeling the topic structure 
of a document. Equally well they may constitute the output of a system. 
For instance, we may want to perform document summarization, automatic 
translation, or attempt to answer natural language queries. 

Compound structures are the most commonly occurring object. That 
is, in most situations we will have a structured mix of different data types. 
For instance, a webpage might contain images, text, tables, which in turn 
contain numbers, and lists, all of which might constitute nodes on a graph of 
webpages linked among each other. Good statistical modelling takes such de- 
pendencies and structures into account in order to tailor sufficiently flexible 
models. 


1.1.3 Problems 


The range of learning problems is clearly large, as we saw when discussing 
applications. That said, researchers have identified an ever growing number 
of templates which can be used to address a large set of situations. It is those 
templates which make deployment of machine learning in practice easy and 
our discussion will largely focus on a choice set of such problems. We now 
give a by no means complete list of templates. 

Binary Classification is probably the most frequently studied problem 
in machine learning and it has led to a large number of important algorithmic 
and theoretic developments over the past century. In its simplest form it 
reduces to the question: given a pattern x drawn from a domain X, estimate 


which value an associated binary random variable y € {+1} will assume. 
For instance, given pictures of apples and oranges, we might want to state 
whether the object in question is an apple or an orange. Equally well, we 
might want to predict whether a home owner might default on his loan, 
given income data, his credit history, or whether a given e-mail is spam or 
ham. The ability to solve this basic problem already allows us to address a 
large variety of practical settings. 

There are many variants exist with regard to the protocol in which we are 
required to make our estimation: 
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Fig. 1.6. Left: binary classification. Right: 3-class classification. Note that in the 
latter case we have much more degree for ambiguity. For instance, being able to 
distinguish stars from diamonds may not suffice to identify either of them correctly, 
since we also need to distinguish both of them from triangles. 


We might see a sequence of (x;, y;) pairs for which y; needs to be estimated 
in an instantaneous online fashion. This is commonly referred to as online 
learning. 

We might observe a collection X := {x1,...%m} and Y := {y1,... ym} of 
pairs (x;, y;) which are then used to estimate y for a (set of) so-far unseen 
X’ = {2',...,2/,,}. This is commonly referred to as batch learning. 

We might be allowed to know X’ already at the time of constructing the 
model. This is commonly referred to as transduction. 

We might be allowed to choose X for the purpose of model building. This 
is known as active learning. 

We might not have full information about X, e.g. some of the coordinates 
of the x; might be missing, leading to the problem of estimation with 
missing variables. 

The sets X and X’ might come from different data sources, leading to the 
problem of covariate shift correction. 

We might be given observations stemming from two problems at the same 
time with the side information that both problems are somehow related. 
This is known as co-training. 

Mistakes of estimation might be penalized differently depending on the 
type of error, e.g. when trying to distinguish diamonds from rocks a very 
asymmetric loss applies. 


Multiclass Classification is the logical extension of binary classifica- 


tion. The main difference is that now y € {1,...,n} may assume a range 


of 


different values. For instance, we might want to classify a document ac- 


cording to the language it was written in (English, French, German, Spanish, 


Hindi, Japanese, Chinese, ...). See Figure 1.6 for an example. The main dif- 


ference to before is that the cost of error may heavily depend on the type of 
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Fig. 1.7. Regression estimation. We are given a number of instances (indicated by 
black dots) and would like to find some function f mapping the observations X to 
R such that f(z) is close to the observed values. 


error we make. For instance, in the problem of assessing the risk of cancer, it 
makes a significant difference whether we mis-classify an early stage of can- 
cer as healthy (in which case the patient is likely to die) or as an advanced 
stage of cancer (in which case the patient is likely to be inconvenienced from 
overly aggressive treatment). 

Structured Estimation goes beyond simple multiclass estimation by 
assuming that the labels y have some additional structure which can be used 
in the estimation process. For instance, y might be a path in an ontology, 
when attempting to classify webpages, y might be a permutation, when 
attempting to match objects, to perform collaborative filtering, or to rank 
documents in a retrieval setting. Equally well, y might be an annotation of 
a text, when performing named entity recognition. Each of those problems 
has its own properties in terms of the set of y which we might consider 
admissible, or how to search this space. We will discuss a number of those 
problems in Chapter ??. 

Regression is another prototypical application. Here the goal is to esti- 
mate a real-valued variable y € R given a pattern x (see e.g. Figure 1.7). For 
instance, we might want to estimate the value of a stock the next day, the 
yield of a semiconductor fab given the current process, the iron content of 
ore given mass spectroscopy measurements, or the heart rate of an athlete, 
given accelerometer data. One of the key issues in which regression problems 
differ from each other is the choice of a loss. For instance, when estimating 
stock values our loss for a put option will be decidedly one-sided. On the 
other hand, a hobby athlete might only care that our estimate of the heart 
rate matches the actual on average. 

Novelty Detection is a rather ill-defined problem. It describes the issue 
of determining “unusual” observations given a set of past measurements. 
Clearly, the choice of what is to be considered unusual is very subjective. 
A commonly accepted notion is that unusual events occur rarely. Hence a 
possible goal is to design a system which assigns to each observation a rating 
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Fig. 1.8. Left: typical digits contained in the database of the US Postal Service. 
Right: unusual digits found by a novelty detection algorithm [SPST*01] (for a 
description of the algorithm see Section 7.4). The score below the digits indicates 
the degree of novelty. The numbers on the lower right indicate the class associated 
with the digit. 


as to how novel it is. Readers familiar with density estimation might contend 
that the latter would be a reasonable solution. However, we neither need a 
score which sums up to 1 on the entire domain, nor do we care particularly 
much about novelty scores for typical observations. We will later see how this 
somewhat easier goal can be achieved directly. Figure 1.8 has an example of 
novelty detection when applied to an optical character recognition database. 


1.2 Probability Theory 


In order to deal with the instances of where machine learning can be used, we 
need to develop an adequate language which is able to describe the problems 
concisely. Below we begin with a fairly informal overview over probability 
theory. For more details and a very gentle and detailed discussion see the 
excellent book of [BT03]. 


1.2.1 Random Variables 


Assume that we cast a dice and we would like to know our chances whether 
we would see 1 rather than another digit. If the dice is fair all six outcomes 
X = {1,...,6} are equally likely to occur, hence we would see a 1 in roughly 
1 out of 6 cases. Probability theory allows us to model uncertainty in the out- 


come of such experiments. Formally we state that 1 occurs with probability 
1 


: In many experiments, such as the roll of a dice, the outcomes are of a 
numerical nature and we can handle them easily. In other cases, the outcomes 
may not be numerical, e.g., if we toss a coin and observe heads or tails. In 
these cases, it is useful to associate numerical values to the outcomes. This 


is done via a random variable. For instance, we can let a random variable 
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X take on a value +1 whenever the coin lands heads and a value of —1 
otherwise. Our notational convention will be to use uppercase letters, e.g., 
X,Y etc to denote random variables and lower case letters, e.g., x, y etc to 
denote the values they take. 


weight 


Fig. 1.9. The random variable € maps from the set of outcomes of an experiment 
(denoted here by X) to real numbers. As an illustration here X consists of the 
patients a physician might encounter, and they are mapped via € to their weight 
and height. 


1.2.2 Distributions 


Perhaps the most important way to characterize a random variable is to 
associate probabilities with the values it can take. If the random variable is 
discrete, i.e., it takes on a finite number of values, then this assignment of 
probabilities is called a probability mass function or PMF for short. A PMF 
must be, by definition, non-negative and must sum to one. For instance, 
if the coin is fair, i.e., heads and tails are equally likely, then the random 
variable X described above takes on values of +1 and —1 with probability 
0.5. This can be written as 


Pr(X = +1) =0.5 and Pr(X = —-1) =0.5. (1.1) 


When there is no danger of confusion we will use the slightly informal no- 
tation p(x) i= Pr(X =z). 

In case of a continuous random variable the assignment of probabilities 
results in a probability density function or PDF for short. With some abuse 
of terminology, but keeping in line with convention, we will often use density 
or distribution instead of probability density function. As in the case of the 
PMF, a PDF must also be non-negative and integrate to one. Figure 1.10 
shows two distributions: the uniform distribution 


= if x a 
wey={o hdscg. (1.2) 


0 otherwise, 
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Fig. 1.10. Two common densities. Left: uniform distribution over the interval 
[—1,1]. Right: Normal distribution with zero mean and unit variance. 


and the Gaussian distribution (also called normal distribution) 


exp ( (2 = wr (1.3) 


Closely associated with a PDF is the indefinite integral over p. It is com- 
monly referred to as the cumulative distribution function (CDF). 


Definition 1.1 (Cumulative Distribution Function) For a real valued 
random variable X with PDF p the associated Cumulative Distribution Func- 
tion F' is given by 


Fay Pri Xa} = [ dp(x). (1.4) 


The CDF F(z’) allows us to perform range queries on p efficiently. For 
instance, by integral calculus we obtain 


b 
Pratt <6) = | dp(x) = F(b) — F(a). (1.5) 


The values of x’ for which F(x’) assumes a specific value, such as 0.1 or 0.5 
have a special name. They are called the quantiles of the distribution p. 


Definition 1.2 (Quantiles) Let q € (0,1). Then the value of x’ for which 
Pr(X <2’) <q and Pr(X > 2’) < 1-4 is the q-quantile of the distribution 
p. Moreover, the value x’ associated with q = 0.5 is called the median. 
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p(x) 


Fig. 1.11. Quantiles of a distribution correspond to the area under the integral of 
the density p(a) for which the integral takes on a pre-specified value. Illustrated 
are the 0.1, 0.5 and 0.9 quantiles respectively. 


1.2.8 Mean and Variance 

A common question to ask about a random variable is what its expected 
value might be. For instance, when measuring the voltage of a device, we 
might ask what its typical values might be. When deciding whether to ad- 
minister a growth hormone to a child a doctor might ask what a sensible 
range of height should be. For those purposes we need to define expectations 
and related quantities of distributions. 


Definition 1.3 (Mean) We define the mean of a random variable X as 


E[X] := [ ede) (1.6) 


More generally, if f : R > R is a function, then f(X) is also a random 


variable. Its mean is mean given by 
af = f He)ap(a) (1.7) 


Whenever X is a discrete random variable the integral in (1.6) can be re- 


placed by a summation: 


E[X] = S— 2p(z). (1.8) 


For instance, in the case of a dice we have equal probabilities of 1/6 for all 
6 possible outcomes. It is easy to see that this translates into a mean of 
(1+24+34+4+5+46)/6 =3.5. 

The mean of a random variable is useful in assessing expected losses and 
benefits. For instance, as a stock broker we might be interested in the ex- 
pected value of our investment in a year’s time. In addition to that, however, 
we also might want to investigate the risk of our investment. That is, how 
likely it is that the value of the investment might deviate from its expecta- 
tion since this might be more relevant for our decisions. This means that we 


16 1 Introduction 
need a variable to quantify the risk inherent in a random variable. One such 


measure is the variance of a random variable. 


Definition 1.4 (Variance) We define the variance of a random variable 
X as 


Var[|X] := E (x 7 E[X])”| (1.9) 
As before, if f : RR > R is a function, then the variance of f(X) is given by 


Varl f(X)] = E [(F(X) — ELF(X))] (1.10) 


The variance measures by how much on average f(X) deviates from its ex- 
pected value. As we shall see in Section 2.1, an upper bound on the variance 
can be used to give guarantees on the probability that f(X) will be within 
€ of its expected value. This is one of the reasons why the variance is often 
associated with the risk of a random variable. Note that often one discusses 
properties of a random variable in terms of its standard deviation, which is 
defined as the square root of the variance. 


1.2.4 Marginalization, Independence, Conditioning, and Bayes 
Rule 


Given two random variables X and Y, one can write their joint density 
p(x,y). Given the joint density, one can recover p(a) by integrating out y. 
This operation is called marginalization: 


p(x) = / dp(x,y). (1.11) 
] 


If Y is a discrete random variable, then we can replace the integration with 
a summation: 


p(x) = >) p(2,y). (1.12) 
y 


We say that X and Y are independent, 7.e., the values that X takes does 
not depend on the values that Y takes whenever 


p(x, y) = p(x)p(y). (1.13) 


Independence is useful when it comes to dealing with large numbers of ran- 
dom variables whose behavior we want to estimate jointly. For instance, 
whenever we perform repeated measurements of a quantity, such as when 
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Fig. 1.12. Left: a sample from two dependent random variables. Knowing about 
first coordinate allows us to improve our guess about the second coordinate. Right: 
a sample drawn from two independent random variables, obtained by randomly 
permuting the dependent sample. 


measuring the voltage of a device, we will typically assume that the individ- 
ual measurements are drawn from the same distribution and that they are 
independent of each other. That is, having measured the voltage a number 
of times will not affect the value of the next measurement. We will call such 
random variables to be independently and identically distributed, or in short, 
wid random variables. See Figure 1.12 for an example of a pair of random 
variables drawn from dependent and independent distributions respectively. 

Conversely, dependence can be vital in classification and regression prob- 
lems. For instance, the traffic lights at an intersection are dependent of each 
other. This allows a driver to perform the inference that when the lights are 
green in his direction there will be no traffic crossing his path, i.e. the other 
lights will indeed be red. Likewise, whenever we are given a picture x of a 
digit, we hope that there will be dependence between «x and its label y. 

Especially in the case of dependent random variables, we are interested 
in conditional probabilities, i.e., probability that X takes on a particular 
value given the value of Y. Clearly Pr(X = rain|Y = cloudy) is higher than 
Pr(X =rain|Y = sunny). In other words, knowledge about the value of Y 
significantly influences the distribution of X. This is captured via conditional 
probabilities: 


p(aly) := ee (1.14) 


Equation 1.14 leads to one of the key tools in statistical inference. 


Theorem 1.5 (Bayes Rule) Denote by X and Y random variables then 
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the following holds 


p(zly)p(y) 


(a) (1.15) 


p(y|x) = 
This follows from the fact that p(x, y) = p(z|y)p(y) = p(y|x)p(x). The key 
consequence of (1.15) is that we may reverse the conditioning between a 
pair of random variables. 


1.2.4.1 An Example 


We illustrate our reasoning by means of a simple example — inference using 
an AIDS test. Assume that a patient would like to have such a test carried 
out on him. The physician recommends a test which is guaranteed to detect 
HIV-positive whenever a patient is infected. On the other hand, for healthy 
patients it has a 1% error rate. That is, with probability 0.01 it diagnoses 
a patient as HIV-positive even when he is, in fact, HIV-negative. Moreover, 
assume that 0.15% of the population is infected. 

Now assume that the patient has the test carried out and the test re- 
turns HIV-negative’. In this case, logic implies that he is healthy, since the 
test has 100% detection rate. In the converse case things are not quite as 
straightforward. Denote by X and T the random variables associated with 
the health status of the patient and the outcome of the test respectively. We 
are interested in p(X = HIV+|T = HIV+). By Bayes rule we may write 


p(T = HIV+|X = HIV+)p(X = HIV+) 
p(T = HIV+) 


p(X = HIV+|T = HIV+) = 


While we know all terms in the numerator, p(T = HIV+) itself is unknown. 
That said, it can be computed via 


p(T = HIV+) p(T = HIV+, z) 


2€{HIV+,HIV-} 


p(T = HIV+|c)p(2) 


x€{HIV+,HIV-} 
= 1.0- 0.0015 + 0.01 - 0.9985. 


Substituting back into the conditional expression yields 


1.0- 0.0015 
X =HIViir =n = 0.1306. 
P( saa V+) = 79 -0.0015 40.01 -0.9985 


In other words, even though our test is quite reliable, there is such a low 
prior probability of having been infected with AIDS that there is not much 


evidence to accept the hypothesis even after this test. 
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Fig. 1.13. A graphical description of our HIV testing scenario. Knowing the age of 
the patient influences our prior on whether the patient is HIV positive (the random 
variable X). The outcomes of the tests 1 and 2 are independent of each other given 
the status X. We observe the shaded random variables (age, test 1, test 2) and 
would like to infer the un-shaded random variable X. This is a special case of a 
graphical model which we will discuss in Chapter ??. 


Let us now think how we could improve the diagnosis. One way is to ob- 
tain further information about the patient and to use this in the diagnosis. 
For instance, information about his age is quite useful. Suppose the patient 
is 35 years old. In this case we would want to compute p(X = HIV+|T = 
HIV+, A = 35) where the random variable A denotes the age. The corre- 
sponding expression yields: 


p(T = HIV+|X = HIV+, A)p(X = HIV+|A) 
p(T = HIV+|A) 


Here we simply conditioned all random variables on A in order to take addi- 


tional information into account. We may assume that the test is independent 
of the age of the patient, i.e. 


p(t|x, a) = p(t|a). 


What remains therefore is p(X = HIV+|A). Recent US census data pegs this 
number at approximately 0.9%. Plugging all data back into the conditional 
expression yields SETI EREM ROR CTE = 0.48. What has happened here is that 
by including additional observed random variables our estimate has become 
more reliable. Combination of evidence is a powerful tool. In our case it 
helped us make the classification problem of whether the patient is HIV- 
positive or not more reliable. 

A second tool in our arsenal is the use of multiple measurements. After 
the first test the physician is likely to carry out a second test to confirm the 
diagnosis. We denote by T, and T> (and t1,t2 respectively) the two tests. 
Obviously, what we want is that TJ will give us an “independent” second 
opinion of the situation. In other words, we want to ensure that T> does 
not make the same mistakes as 7J,. For instance, it is probably a bad idea 
to repeat 7; without changes, since it might perform the same diagnostic 
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mistake as before. What we want is that the diagnosis of 75 is independent 
of that of T> given the health status X of the patient. This is expressed as 


p(t1, ta|z) = p(ti|z)p(ta|2). (1.16) 


See Figure 1.13 for a graphical illustration of the setting. Random variables 
satisfying the condition (1.16) are commonly referred to as conditionally 
independent. In shorthand we write 7), T2 IL X. For the sake of the argument 
we assume that the statistics for T> are given by 


p(t2|x) x=HIV- «=HIV+ 
tg =HIV- 0.95 0.01 
tg = HIV+ 0.05 0.99 


Clearly this test is less reliable than the first one. However, we may now 
combine both estimates to obtain a very reliable estimate based on the 
combination of both events. For instance, for t; = tg = HIV+ we have 


1.0 - 0.99 - 0.009 
Haye =n Give — 0.95. 
P( tT, +42 +) 1.0 - 0.99 - 0.009 + 0.01 - 0.05 - 0.991 ee 


In other words, by combining two tests we can now confirm with very high 
confidence that the patient is indeed diseased. What we have carried out is a 
combination of evidence. Strong experimental evidence of two positive tests 


effectively overcame an initially very strong prior which suggested that the 
patient might be healthy. 

Tests such as in the example we just discussed are fairly common. For 
instance, we might need to decide which manufacturing procedure is prefer- 
able, which choice of parameters will give better results in a regression es- 
timator, or whether to administer a certain drug. Note that often our tests 
may not be conditionally independent and we would need to take this into 
account. 


1.3 Basic Algorithms 


We conclude our introduction to machine learning by discussing four simple 
algorithms, namely Naive Bayes, Nearest Neighbors, the Mean Classifier, 
and the Perceptron, which can be used to solve a binary classification prob- 
lem such as that described in Figure 1.5. We will also introduce the K-means 
algorithm which can be employed when labeled data is not available. All 
these algorithms are readily usable and easily implemented from scratch in 
their most basic form. 

For the sake of concreteness assume that we are interested in spam filter- 
ing. That is, we are given a set of m e-mails x;, denoted by X := {2,...,@m} 
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From: "LucindaParkison497072" <LucindaParkison497072@hotmail.com> 

To: <kargr@earthlink.net> 

Subject: we think ACGU is our next winner 

Date: Mon, 25 Feb 2008 00:01:01 -0500 

MIME-Version: 1.0 

X-OriginalArrivalTime: 25 Feb 2008 05:01:01.0329 (UTC) FILETIME=[6A931810:01C8776B] 
Return-Path: lucindaparkison497072@hotmail.com 

(ACGU) .045 UP 104.5% 

I do think that (ACGU) at it’s current levels looks extremely attractive. 


Asset Capital Group, Inc., (ACGU) announced that it is expanding the marketing of bioremediation fluids and cleaning equipment. After 
its recent acquisition of interest in American Bio-Clean Corporation and an 80 


News is expected to be released next week on this growing company and could drive the price even higher. Buy (ACGU) Monday at open. I 
believe those involved at this stage could enjoy a nice ride up. 


Fig. 1.14. Example of a spam e-mail 


v1: The quick brown fox jumped over the lazy dog. 
x2: The dog hunts a fox. 


the quick brown fox jumped over lazy dog hunts a 


ry 2 1 1 1 1 1 1 1 0 0 
tz 1 0 0 1 0 0 0 1 1 1 


Fig. 1.15. Vector space representation of strings. 


and associated labels y;, denoted by Y := {y1,.--, Ym}. Here the labels sat- 
isfy y; € {spam,ham}. The key assumption we make here is that the pairs 
(x;,y;) are drawn jointly from some distribution p(x, y) which represents 
the e-mail generating process for a user. Moreover, we assume that there 
is sufficiently strong dependence between x and y that we will be able to 
estimate y given x and a set of labeled instances X, Y. 


Before we do so we need to address the fact that e-mails such as Figure 1.14 
are text, whereas the three algorithms we present will require data to be 
represented in a vectorial fashion. One way of converting text into a vector 
is by using the so-called bag of words representation [Mar61, Lew98]. In its 
simplest version it works as follows: Assume we have a list of all possible 
words occurring in X, that is a dictionary, then we are able to assign a unique 
number with each of those words (e.g. the position in the dictionary). Now 
we may simply count for each document x; the number of times a given 
word j is occurring. This is then used as the value of the j-th coordinate 
of x;. Figure 1.15 gives an example of such a representation. Once we have 
the latter it is easy to compute distances, similarities, and other statistics 
directly from the vectorial representation. 
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1.3.1 Naive Bayes 


In the example of the AIDS test we used the outcomes of the test to infer 
whether the patient is diseased. In the context of spam filtering the actual 
text of the e-mail x corresponds to the test and the label y is equivalent to 
the diagnosis. Recall Bayes Rule (1.15). We could use the latter to infer 


p(x|y)p(y) 
p(y|z) = ——. 
p(z) 
We may have a good estimate of p(y), that is, the probability of receiving 
a spam or ham mail. Denote by Mpam and Mspam the number of ham and 
spam e-mails in X. In this case we can estimate 
ha: Mspam 


“and p(spam) ~ 
m 


p(ham) © 


The key problem, however, is that we do not know p(z|y) or p(x). We may 
dispose of the requirement of knowing p(x) by settling for a likelihood ratio 


L(x) := p(spam|x) _ p(z|spam)p(spam) 


“~ p(ham|z) ~—-p(x|ham)p(ham) ° a) 


Whenever L(x) exceeds a given threshold c we decide that x is spam and 
consequently reject the e-mail. If c is large then our algorithm is conservative 
and classifies an email as spam only if p(spam|x) >> p(ham|z). On the other 
hand, if c is small then the algorithm aggressively classifies emails as spam. 

The key obstacle is that we have no access to p(z|y). This is where we make 
our key approximation. Recall Figure 1.13. In order to model the distribution 
of the test outcomes 7; and Ty we made the assumption that they are 
conditionally independent of each other given the diagnosis. Analogously, 
we may now treat the occurrence of each word in a document as a separate 
test and combine the outcomes in a naive fashion by assuming that 


# of words in x 


pcly)= [] — rlw'ly), (1.18) 


j=l 


where w! denotes the j-th word in document x. This amounts to the as- 
sumption that the probability of occurrence of a word in a document is 
independent of all other words given the category of the document. Even 
though this assumption does not hold in general—for instance, the word 
“York” is much more likely to after the word “New” —it suffices for our 
purposes (see Figure 1.16). 

This assumption reduces the difficulty of knowing p(z|y) to that of esti- 
mating the probabilities of occurrence of individual words w. Estimates for 
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Fig. 1.16. Naive Bayes model. The occurrence of individual words is independent 
of each other, given the category of the text. For instance, the word Viagra is fairly 
frequent if y = spam but it is considerably less frequent if y = ham, except when 
considering the mailbox of a Pfizer sales representative. 


p(wly) can be obtained, for instance, by simply counting the frequency oc- 
currence of the word within documents of a given class. That is, we estimate 


m 7 of words in 2; j 
inl j=l / {uy = spam and w? = w} 


ae aay words in 2; fy; = spam} 


p(w|spam) ~ 


Here {yi = spam and w! = w} equals 1 if and only if x; is labeled as spam 
and w occurs as the j-th word in x;. The denominator is simply the total 
number of words in spam documents. Similarly one can compute p(w|ham). 
In principle we could perform the above summation whenever we see a new 
document x. This would be terribly inefficient, since each such computation 
requires a full pass through X and Y. Instead, we can perform a single pass 
through X and Y and store the resulting statistics as a good estimate of the 
conditional probabilities. Algorithm 1.1 has details of an implementation. 
Note that we performed a number of optimizations: Firstly, the normaliza- 
tion by Wilssoc and Mon respectively is independent of x, hence we incor- 
porate it as a fixed offset. Secondly, since we are computing a product over 
a large number of factors the numbers might lead to numerical overflow or 
underflow. This can be addressed by summing over the logarithm of terms 
rather than computing products. Thirdly, we need to address the issue of 
estimating p(w|y) for words w which we might not have seen before. One 
way of dealing with this is to increment all counts by 1. This method is 
commonly referred to as Laplace smoothing. We will encounter a theoretical 
justification for this heuristic in Section 2.3. 

This simple algorithm is known to perform surprisingly well, and variants 
of it can be found in most modern spam filters. It amounts to what is 
commonly known as “Bayesian spam filtering”. Obviously, we may apply it 
to problems other than document categorization, too. 
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Algorithm 1.1 Naive Bayes 
Train(X, Y) {reads documents X and labels Y} 
Compute dictionary D of X with n words. 


Compute m,Mham and Mspam- 
Initialize b := log c+log Mham — log Mspam to offset the rejection threshold 
Initialize p € R?*”" with P= 1, Viggen = i = 
{Count occurrence of each word} 
{Here x? denotes the number of times word j occurs in document «;} 
for i=1tomdo 
if y; = spam then 
for j = 1 ton do 
Po.j — Pog +O, 
Wspam *— Wspam + a 
end for 
else 
for j = 1 ton do 
MgShgte, « 
Wham <— Wham + cis 
end for 
end if 
end for 
{Normalize counts to yield word probabilities} 
for j = 1 ton do 
Po,j 0,5 /Wspam 
Pig P15 /Wnem 


end for 
Classify(x) {classifies document x} 
Initialize score threshold t = —b 


for j =1tondo 
t << t+ 2! (log po,j — log p1,;) 
end for 
if t > 0 return spam else return ham 


1.3.2 Nearest Neighbor Estimators 


An even simpler estimator than Naive Bayes is nearest neighbors. In its most 
basic form it assigns the label of its nearest neighbor to an observation x 
(see Figure 1.17). Hence, all we need to implement it is a distance measure 
d(x, x’) between pairs of observations. Note that this distance need not even 
be symmetric. This means that nearest neighbor classifiers can be extremely 
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> > 


Fig. 1.17. 1 nearest neighbor classifier. Depending on whether the query point x is 
closest to the star, diamond or triangles, it uses one of the three labels for it. 
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Fig. 1.18. k-Nearest neighbor classifiers using Euclidean distances. Left: decision 
boundaries obtained from a 1-nearest neighbor classifier. Middle: color-coded sets 
of where the number of red / blue points ranges between 7 and 0. Right: decision 
boundary determining where the blue or red dots are in the majority. 


flexible. For instance, we could use string edit distances to compare two 
documents or information theory based measures. 

However, the problem with nearest neighbor classification is that the esti- 
mates can be very noisy whenever the data itself is very noisy. For instance, 
if a spam email is erroneously labeled as nonspam then all emails which 
are similar to this email will share the same fate. See Figure 1.18 for an 
example. In this case it is beneficial to pool together a number of neighbors, 
say the k-nearest neighbors of x and use a majority vote to decide the class 
membership of x. Algorithm 1.2 has a description of the algorithm. Note 
that nearest neighbor algorithms can yield excellent performance when used 
with a good distance measure. For instance, the technology underlying the 
Netflix progress prize [BIX07] was essentially nearest neighbours based. 

Note that it is trivial to extend the algorithm to regression. All we need 
to change in Algorithm 1.2 is to return the average of the values y; instead 
of their majority vote. Figure 1.19 has an example. 

Note that the distance computation d(x;,) for all observations can be- 
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Algorithm 1.2 k-Nearest Neighbor Classification 
Classify(X, Y,x) {reads documents X, labels Y and query x} 
for i= 1 tom do 
Compute distance d(x;, x) 
end for 


Compute set J containing indices for the & smallest distances d(xj, x). 
return majority label of {y; where 7 € J}. 


Fig. 1.19. k-Nearest neighbor regression estimator using Euclidean distances. Left: 
some points (a,y) drawn from a joint distribution. Middle: 1-nearest neighbour 
classifier. Right: 7-nearest neighbour classifier. Note that the regression estimate is 
much more smooth. 


come extremely costly, in particular whenever the number of observations is 
large or whenever the observations x; live in a very high dimensional space. 


Random projections are a technique that can alleviate the high computa- 
tional cost of Nearest Neighbor classifiers. A celebrated lemma by Johnson 
and Lindenstrauss [DG03] asserts that a set of m points in high dimensional 
Euclidean space can be projected into a O(log m/e?) dimensional Euclidean 
space such that the distance between any two points changes only by a fac- 


tor of (1+). Since Euclidean distances are preserved, running the Nearest 
Neighbor classifier on this mapped data yields the same results but at a 
lower computational cost [GIM99]. 


The surprising fact is that the projection relies on a simple randomized 
algorithm: to obtain a d-dimensional representation of n-dimensional ran- 
dom observations we pick a matrix R € R?*” where each element is drawn 
independently from a normal distribution with n-2 variance and zero mean. 
Multiplying x with this projection matrix can be shown to achieve this prop- 
erty with high probability. For details see [DG03]. 
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Fig. 1.20. A trivial classifier. Classification is carried out in accordance to which of 
the two means js or j4 is closer to the test point x. Note that the sets of positive 
and negative labels respectively form a half space. 


1.3.3 A Simple Classifier 


We can use geometry to design another simple classification algorithm [SS02] 
for our problem. For simplicity we assume that the observations « € R%, such 
as the bag-of-words representation of e-mails. We define the means 4 and 


ju— to correspond to the classes y € {+1} via 


Here we used m_ and m to denote the number of observations with label 
y; = —1 and y; = +1 respectively. An even simpler approach than using the 
nearest neighbor classifier would be to use the class label which corresponds 
to the mean closest to a new query 2, as described in Figure 1.20. 

For Euclidean distances we have 


Iu — olf? = |u|? + lle? — 2 (u-,2) and (1.19) 
Ie — all? = lel? + lel? — 2s). (1.20) 


Here (-,-) denotes the standard dot product between vectors. Taking differ- 
ences between the two distances yields 


2 2 2 
I" = 2 (we — py, @) + |e" — ell”. 
(1.21) 


F(x) = ||u4 — al)? — |lv-—@ 


This is a linear function in x and its sign corresponds to the labels we esti- 
mate for x. Our algorithm sports an important property: The classification 
rule can be expressed via dot products. This follows from 


2 — = 
New|? = (asm) = mx? SD (wigaty) and (yxy, 2) = mz! ST (ei, 2). 
wi=yj=l yi=l 
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Fig. 1.21. The feature map ¢ maps observations x from X into a feature space H. 
The map ¢ is a convenient way of encoding pre-processing steps systematically. 


Analogous expressions can be computed for jz. Consequently we may ex- 
press the classification rule (1.21) as 


m 


F(x) =D ay (ai,x) +b (1.22) 


i=1 


where b = m=? Sa e (ai, 23) —m3? i (rq, 23) and af = yi) m,.. 

This offers a number of interesting extensions. Recall that when dealing 
with documents we needed to perform pre-processing to map e-mails into a 
vector space. In general, we may pick arbitrary maps ¢: X — H mapping 
the space of observations into a feature space HH, as long as the latter is 
endowed with a dot product (see Figure 1.21). This means that instead of 
dealing with (x, x’) we will be dealing with (¢(2), 6(2’)). 

As we will see in Chapter 6, whenever is a so-called Reproducing Kernel 
Hilbert Space, the inner product can be abbreviated in the form of a kernel 
function k(x, x’) which satisfies 


kee) = ((x), O(a’). (1.23) 


This small modification leads to a number of very powerful algorithm and 
it is at the foundation of an area of research called kernel methods. We 
will encounter a number of such algorithms for regression, classification, 
segmentation, and density estimation over the course of the book. Examples 
of suitable k are the polynomial kernel k(x, 2’) = (z, x)” for d € N and the 
Gaussian RBF kernel k(x, x’) = e~Tle-2'll’ for a>), 

The upshot of (1.23) is that our basic algorithm can be kernelized. That 
is, we may rewrite (1.21) as 


fi2)= N° aik(ai, 2) +b (1.24) 
i=1 


where as before a; = y;/my, and the offset b is computed analogously. As 
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Algorithm 1.3 The Perceptron 
Perceptron(X, Y) {reads stream of observations (;, yi) } 
Initialize w = 0 and b= 0 
while There exists some (2;, y;) with y;((w,2;) +b) <0 do 
wew+tyez andb<b+y; 
end while 


Algorithm 1.4 The Kernel Perceptron 
KernelPerceptron(X, Y) {reads stream of observations (2;, y:)} 
Initialize f = 0 
while There exists some (2;, y;) with yf (ai) < 0 do 
foftyk(ai,)+y 
end while 


a consequence we have now moved from a fairly simple and pedestrian lin- 
ear classifier to one which yields a nonlinear function f(x) with a rather 
nontrivial decision boundary. 


1.3.4 Perceptron 


In the previous sections we assumed that our classifier had access to a train- 
ing set of spam and non-spam emails. In real life, such a set might be difficult 
to obtain all at once. Instead, a user might want to have instant results when- 
ever a new e-mail arrives and he would like the system to learn immediately 
from any corrections to mistakes the system makes. 

To overcome both these difficulties one could envisage working with the 
following protocol: As emails arrive our algorithm classifies them as spam or 
non-spam, and the user provides feedback as to whether the classification is 
correct or incorrect. This feedback is then used to improve the performance 
of the classifier over a period of time. 

This intuition can be formalized as follows: Our classifier maintains a 
parameter vector. At the t-th time instance it receives a data point x;, to 
which it assigns a label % using its current parameter vector. The true label 
yt is then revealed, and used to update the parameter vector of the classifier. 
Such algorithms are said to be online. We will now describe perhaps the 
simplest classifier of this kind namely the Perceptron [Heb49, Ros5g}. 

Let us assume that the data points 2; € R%, and labels y% € {+1}. As 
before we represent an email as a bag-of-words vector and we assign +1 to 


spam emails and —1 to non-spam emails. The Perceptron maintains a weight 
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Fig. 1.22. The Perceptron without bias. Left: at time t we have a weight vector w; 
denoted by the dashed arrow with corresponding separating plane (also dashed). 
For reference we include the linear separator w* and its separating plane (both 
denoted by a solid line). As a new observation 2, arrives which happens to be 
mis-classified by the current weight vector w; we perform an update. Also note the 
margin between the point x, and the separating hyperplane defined by w*. Right: 
This leads to the weight vector wy41 which is more aligned with w*. 


vector w € R®@ and classifies x; according to the rule 
Y := sign{(w, x) + d}, (1.25) 


where (w, x;) denotes the usual Euclidean dot product and b is an offset. Note 
the similarity of (1.25) to (1.21) of the simple classifier. Just as the latter, 
the Perceptron is a linear classifier which separates its domain R® into two 
halfspaces, namely {z| (w,x) +b > 0} and its complement. If G = y then 
no updates are made. On the other hand, if #% 4 y the weight vector is 
updated as 


wewt yc, andb¢ b+ y%. (1.26) 


Figure 1.22 shows an update step of the Perceptron algorithm. For simplicity 
we illustrate the case without bias, that is, where b = 0 and where it remains 
unchanged. A detailed description of the algorithm is given in Algorithm 1.3. 

An important property of the algorithm is that it performs updates on w 
by multiples of the observations x; on which it makes a mistake. Hence we 
may express was W = )ocprror Yiti- Just as before, we can replace x; and x 
by ¢(a;) and ¢(z) to obtain a kernelized version of the Perceptron algorithm 
[FS99] (Algorithm 1.4). 

If the dataset (X, Y) is linearly separable, then the Perceptron algorithm 
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eventually converges and correctly classifies all the points in X. The rate of 
convergence however depends on the margin. Roughly speaking, the margin 
quantifies how linearly separable a dataset is, and hence how easy it is to 
solve a given classification problem. 


Definition 1.6 (Margin) Let w € R@ be a weight vector and let b € R be 
an offset. The margin of an observation « € R¢@ with associated label y is 


Y(z,y) = y ((w, 2) + 6). (1.27) 
Moreover, the margin of an entire set of observations X with labels Y is 


V(X, ¥) := min 7 (xi, yi). (1.28) 


Geometrically speaking (see Figure 1.22) the margin measures the distance 
of x from the hyperplane defined by {z| (w,x) + b = 0}. Larger the margin, 
the more well separated the data and hence easier it is to find a hyperplane 
with correctly classifies the dataset. The following theorem asserts that. if 
there exists a linear classifier which can classify a dataset with a large mar- 
gin, then the Perceptron will also correctly classify the same dataset after 
making a small number of mistakes. 


Theorem 1.7 (Novikoff’s theorem) Let (X,Y) be a dataset with at least 
one example labeled +1 and one example labeled —1. Let R := max; ||x¢||, and 
assume that there exists (w*,b*) such that ||w*|| =1 and y := ye((w*, ve) + 
. (1+.R?)(1+(6*)?) 
b*) > ¥ for all t. Then, the Perceptron will make at most ~~~ 

rs 
mistakes. 


This result is remarkable since it does not depend on the dimensionality 
of the problem. Instead, it only depends on the geometry of the setting, 
as quantified via the margin y and the radius R of a ball enclosing the 
observations. Interestingly, a similar bound can be shown for Support Vector 
Machines [Vap95] which we will be discussing in Chapter 7. 

Proof We can safely ignore the iterations where no mistakes were made 
and hence no updates were carried out. Therefore, without loss of generality 
assume that the t-th update was made after seeing the t-th observation and 
let w; denote the weight vector after the update. Furthermore, for simplicity 
assume that the algorithm started with wo = 0 and bo = 0. By the update 
equation (1.26) we have 


(we, w*) + bpb* = (we_1, w*) + be_10* + yz ((2z, w*) + 0*) 
Sia A eb Wea ey 
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By induction it follows that (w;, w*) +b,b* > ty. On the other hand we made 
an update because y;((xz, Wi_1) + b¢-1) < 0. By using yy = 1, 


|| wel]? + 02 = [fewer ||? + O24 + y? llzell? +1 + 2ye((eve-1, ve) + be-1) 
< |fewe_al|? + 02-1 + Ilzel|? +1 


Since ||2;||? = R? we can again apply induction to conclude that ||w;||?+-02 < 


t [R? + I]. Combining the upper and the lower bounds, using the Cauchy- 
Schwartz inequality, and ||w*|| = 1 yields 


eee hey 
<([ ]II[¢ ]|- verraverer 


< VRE D VI+ OP. 


Squaring both sides of the inequality and rearranging the terms yields an 


upper bound on the number of updates and hence the number of mistakes. li 


The Perceptron was the building block of research on Neural Networks 
[Hay98, Bis95]. The key insight was to combine large numbers of such net- 
works, often in a cascading fashion, to larger objects and to fashion opti- 
mization algorithms which would lead to classifiers with desirable properties. 
In this book we will take a complementary route. Instead of increasing the 
number of nodes we will investigate what happens when increasing the com- 
plexity of the feature map ¢ and its associated kernel k. The advantage of 
doing so is that we will reap the benefits from convex analysis and linear 
models, possibly at the expense of a slightly more costly function evaluation. 


1.3.5 K-Means 


All the algorithms we discussed so far are supervised, that is, they assume 
that labeled training data is available. In many applications this is too much 
to hope for; labeling may be expensive, error prone, or sometimes impossi- 
ble. For instance, it is very easy to crawl and collect every page within the 
www.purdue.edu domain, but rather time consuming to assign a topic to 
each page based on its contents. In such cases, one has to resort to unsuper- 
vised learning. A prototypical unsupervised learning algorithm is K-means, 
which is clustering algorithm. Given X = {21,...,2} the goal of K-means 
is to partition it into k clusters such that each point in a cluster is similar 
to points from its own cluster than with points from some other cluster. 
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Towards this end, define prototype vectors p4,...,, and an indicator 
vector rj; which is 1 if, and only if, x; is assigned to cluster j. To cluster our 
dataset we will minimize the following distortion measure, which minimizes 
the distance of each point from the prototype vector: 


m &k 
1 
I(r, p) = 5 oe isle — wl, (1.29) 


i=1 j=1 


where r = {rij}, = {uj}, and || - ||? denotes the usual Euclidean square 
norm. 

Our goal is to find r and yp, but since it is not easy to jointly minimize J 
with respect to both r and p, we will adapt a two stage strategy: 


Stage 1 Keep the y fixed and determine r. In this case, it is easy to see 
that the minimization decomposes into m independent problems. 
The solution for the i-th data point x; can be found by setting: 
rij = 1 if j = argmin |x; — p;/||?, (1.30) 
7 
and 0 otherwise. 
Stage 2 Keep the r fixed and determine yu. Since the r’s are fixed, J is an 
quadratic function of yu. It can be minimized by setting the derivative 
with respect to jz; to be 0: 
m 
tiga = 7) = for ally. (1-31) 
i=1 


Rearranging obtains 


di Tigi 
pee c3o 
7 ry vel 
Since 5°, rj; counts the number of points assigned to cluster j, we are 
essentially setting j1; to be the sample mean of the points assigned 
to cluster j. 


The algorithm stops when the cluster assignments do not change signifi- 
cantly. Detailed pseudo-code can be found in Algorithm 1.5. 

Two issues with K-Means are worth noting. First, it is sensitive to the 
choice of the initial cluster centers . A number of practical heuristics have 
been developed. For instance, one could randomly choose k points from the 
given dataset as cluster centers. Other methods try to pick & points from X 
which are farthest away from each other. Second, it makes a hard assignment 
of every point to a cluster center. Variants which we will encounter later in 
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Algorithm 1.5 K-Means 

Cluster(X) {Cluster dataset X} 
Initialize cluster centers yj; for 7 = 1,...,k randomly 
repeat 


for i= 1 tom do 
Compute j’ = argmin,_;,, d(xi, [4j) 
Set rjj = 1 and rj; = 0 for all 7’ 4 j 
end for 
for j = 1 tok do 
_ Lrg ®i 
Compute pj; = ara 
end for 
until Cluster assignments r;; are unchanged 


return {/1,...,{,} and rj; 


the book will relax this. Instead of letting rj; € {0,1} these soft variants 
will replace it with the probability that a given x; belongs to cluster 7. 

The K-Means algorithm concludes our discussion of a set of basic machine 
learning methods for classification and regression. They provide a useful 
starting point for an aspiring machine learning researcher. In this book we 
will see many more such algorithms as well as connections between these 
basic algorithms and their more advanced counterparts. 


Problems 


Problem 1.1 (Eyewitness) Assume that an eyewitness is 90% certain 
that a given person committed a crime in a bar. Moreover, assume that 
there were 50 people in the restaurant at the time of the crime. What is the 
posterior probability of the person actually having committed the crime. 


Problem 1.2 (DNA Test) Assume the police have a DNA library of 10 
million records. Moreover, assume that the false recognition probability is 
below 0.00001% per record. Suppose a match is found after a database search 
for an individual. What are the chances that the identification is correct? You 
can assume that the total population is 100 million people. Hint: compute 
the probability of no match occurring first. 


Problem 1.3 (Bomb Threat) Suppose that the probability that one of a 
thousand passengers on a plane has a bomb is 1: 1,000,000. Assuming that 
the probability to have a bomb is evenly distributed among the passengers, 
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the probability that two passengers have a bomb is roughly equal to 10~'?. 
Therefore, one might decide to take a bomb on a plane to decrease chances 
that somebody else has a bomb. What is wrong with this argument? 


Problem 1.4 (Monty-Hall Problem) Assume that in a TV show the 
candidate is given the choice between three doors. Behind two of the doors 
there is a pencil and behind one there is the grand prize, a car. The candi- 
date chooses one door. After that, the showmaster opens another door behind 
which there is a pencil. Should the candidate switch doors after that? What 
is the probability of winning the car? 


Problem 1.5 (Mean and Variance for Random Variables) Denote by 
X; random variables. Prove that in this case 


BX,,...Xw Ss | — SS Ex, [xi] and Varx,,...xn bs o| = S" Varx, [xi] 


To show the second equality assume independence of the X;. 


Problem 1.6 (Two Dices) Assume you have a game which uses the maz- 
imum of two dices. Compute the probability of seeing any of the events 
{1,...,6}. Hint: prove first that the cumulative distribution function of the 
maximum of a pair of random variables is the square of the original cumu- 
lative distribution function. 


Problem 1.7 (Matching Coins) Consider the following game: two play- 
ers bring a coin each. the first player bets that when tossing the coins both 
will match and the second one bets that they will not match. Show that even 
if one of the players were to bring a tainted coin, the game still would be 
fair. Show that it is in the interest of each player to bring a fair coin to the 
game. Hint: assume that the second player knows that the first coin favors 
heads over tails. 


Problem 1.8 (Randomized Maximization) How many observations do 
you need to draw from a distribution to ensure that the maximum over them 

is larger than 95% of all observations with at least 95% probability? Hint: 
generalize the result from Problem 1.6 to the maximum over n random vari- 
ables. 

Application: Assume we have 1000 computers performing MapReduce [DG 08] 

and the Reducers have to wait until all 1000 Mappers are finished with their 
job. Compute the quantile of the typical time to completion. 
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Problem 1.9 Prove that the Normal distribution (1.3) has mean ww and 
variance o*. Hint: exploit the fact that p is symmetric around p. 


Problem 1.10 (Cauchy Distribution) Prove that for the density 
1 
p(x) 


= —____ 1.33 
m(14+ 2?) 8) 
mean and variance are undefined. Hint: show that the integral diverges. 


Problem 1.11 (Quantiles) Find a distribution for which the mean ezx- 
ceeds the median. Hint: the mean depends on the value of the high-quantile 
terms, whereas the median does not. 


Problem 1.12 (Multicategory Naive Bayes) Prove that for multicate- 
gory Naive Bayes the optimal decision is given by 


n 


y"(z) = argmaxp(y) [ [ety (1.34) 
i=1 


where y € Y is the class label of the observation x. 


Problem 1.13 (Bayes Optimal Decisions) Denote by y*(x) = argmax, p(y|z) 
the label associated with the largest conditional class probability. Prove that 
for y*(ax) the probability of choosing the wrong label y is given by 


I(x) = 1— p(y*(x)|2). 


Moreover, show that y* (x) is the label incurring the smallest misclassification 
error. 


Problem 1.14 (Nearest Neighbor Loss) Show that the expected loss in- 
curred by the nearest neighbor classifier does not exceed twice the loss of the 
Bayes optimal decision. 


2 


Density Estimation 


2.1 Limit Theorems 


Assume you are a gambler and go to a casino to play a game of dice. As 
it happens, it is your unlucky day and among the 100 times you toss the 
dice, you only see 


should occur with equal probability é- Hence the expected value over 100 
100 


6’ 


eleven times. For a fair dice we know that each face 
draws is 17, which is considerably more than the eleven times that we 
observed. Before crying foul you decide that some mathematical analysis is 
in order. 

The probability of seeing a particular sequence of m trials ou of which n 
are a ’6’ is given by eR "Moreover, there are ("”) = Tay different 
sequences of ’6’ and ’not 6’ with proportions n and m—n respectively. Hence 


we may compute the probability of seeing a ’6’ only 11 or less via 


Pr(X < 11) = Sn =; ee +) 2] me = 7.0% (2.1) 


1=0 


After looking at this figure you decide that things are probably reasonable. 
And, in fact, they are consistent with the convergence behavior of a sim- 
ulated dice in Figure 2.1. In computing (2.1) we have learned something 
useful: the expansion is a special case of a binomial series. The first term 


m=10 m=20 m=50 m=100 ~m=200 —_m=500 


0.0 0 


123456. 123456 123456 123456 123456 123456 


Fig. 2.1. Convergence of empirical means to expectations. From left to right: em- 
pirical frequencies of occurrence obtained by casting a dice 10, 20, 50, 100, 200, and 
500 times respectively. Note that after 20 throws we still have not observed a single 


’6’, an event which occurs with only Ee =~ 2.6% probability. 
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counts the number of configurations in which we could observe 7 times ’6’ ina 
sequence of 100 dice throws. The second and third term are the probabilities 
of seeing one particular instance of such a sequence. 

Note that in general we may not be as lucky, since we may have con- 
siderably less information about the setting we are studying. For instance, 
we might not know the actual probabilities for each face of the dice, which 
would be a likely assumption when gambling at a casino of questionable 
reputation. Often the outcomes of the system we are dealing with may be 
continuous valued random variables rather than binary ones, possibly even 
with unknown range. For instance, when trying to determine the average 
wage through a questionnaire we need to determine how many people we 
need to ask in order to obtain a certain level of confidence. 

To answer such questions we need to discuss limit theorems. They tell 
us by how much averages over a set of observations may deviate from the 
corresponding expectations and how many observations we need to draw to 
estimate a number of probabilities reliably. For completeness we will present 
proofs for some of the more fundamental theorems in Section 2.1.2. They 
are useful albeit non-essential for the understanding of the remainder of the 
book and may be omitted. 


2.1.1 Fundamental Laws 


The Law of Large Numbers developed by Bernoulli in 1713 is one of the 
fundamental building blocks of statistical analysis. It states that averages 
over a number of observations converge to their expectations given a suffi- 
ciently large number of observations and given certain assumptions on the 
independence of these observations. It comes in two flavors: the weak and 
the strong law. 


Theorem 2.1 (Weak Law of Large Numbers) Denote by X1,...,Xm 
random variables drawn from p(x) with mean pp = Ex, |x;| for alli. Moreover 
let 


Xm i= =S°X (2.2) 


be the empirical average over the random variables X;. Then for any ¢ > 0 
the following holds 


lim Pr (|Xm _ Ll Se) = 1, (2.3) 


m—->oco 
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Fig. 2.2. The mean of a number of casts of a dice. The horizontal straight line 
denotes the mean 3.5. The uneven solid line denotes the actual mean X,, as a 
function of the number of draws, given as a semilogarithmic plot. The crosses denote 
the outcomes of the dice. Note how X,, ever more closely approaches the mean 3.5 
are we obtain an increasing number of observations. 


This establishes that, indeed, for large enough sample sizes, the average will 
converge to the expectation. The strong law strengthens this as follows: 


Theorem 2.2 (Strong Law of Large Numbers) Under the conditions 
of Theorem 2.1 we have Pr (ities sa Xm = 7) =1. 


The strong law implies that almost surely (in a measure theoretic sense) Xm 
converges to y, whereas the weak law only states that for every € the random 
variable Xj, will be within the interval [j1—e, +e]. Clearly the strong implies 
the weak law since the measure of the events X,, = 4 converges to 1, hence 
any e-ball around yz would capture this. 

Both laws justify that we may take sample averages, e.g. over a number 
of events such as the outcomes of a dice and use the latter to estimate their 
means, their probabilities (here we treat the indicator variable of the event 
as a {0; 1}-valued random variable), their variances or related quantities. We 
postpone a proof until Section 2.1.2, since an effective way of proving Theo- 
rem 2.1 relies on the theory of characteristic functions which we will discuss 
in the next section. For the moment, we only give a pictorial illustration in 
Figure 2.2. 

Once we established that the random variable X,, = m7! oe Xi con- 
verges to its mean p, a natural second question is to establish how quickly it 
converges and what the properties of the limiting distribution of X,,—j are. 
Note in Figure 2.2 that the initial deviation from the mean is large whereas 
as we observe more data the empirical mean approaches the true one. 


40 2 Density Estimation 


= 
Lae 


SO gu Uae 
i nal 


10! 10? 108 


Fig. 2.3. Five instantiations of a running average over outcomes of a toss of a dice. 
Note that all of them converge to the mean 3.5. Moreover note that they all are 


well contained within the upper and lower envelopes given by p+ \/ Varx[2]/m. 


The central limit theorem answers this question exactly by addressing a 
slightly more general question, namely whether the sum over a number of 
independent random variables where each of them arises from a different 
distribution might also have a well behaved limiting distribution. This is 
the case as long as the variance of each of the random variables is bounded. 
The limiting distribution of such a sum is Gaussian. This affirms the pivotal 
role of the Gaussian distribution. 


Theorem 2.3 (Central Limit Theorem) Denote by X; independent ran- 
dom variables with means pu; and standard deviation o;. Then 


Zm = bs | bs Xi | (2.4) 


converges to a Normal Distribution with zero mean and unit variance. 


Note that just like the law of large numbers the central limit theorem (CLT) 
is an asymptotic result. That is, only in the limit of an infinite number of 
observations will it become exact. That said, it often provides an excellent 
approximation even for finite numbers of observations, as illustrated in Fig- 
ure 2.4. In fact, the central limit theorem and related limit theorems build 
the foundation of what is known as asymptotic statistics. 


Example 2.1 (Dice) Jf we are interested in computing the mean of the 
values returned by a dice we may apply the CLT to the sum over m variables 
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which have all mean ts = 3.5 and variance (see Problem 2.1) 
Var x [2] = Ex|x?] — Ex[x]? = (1+ 4+9+4 16 4+ 25 + 36)/6 — 3.5? = 2.92. 


We now study the random variable Wm, := m~* 7, [X; — 3.5]. Since each 
of the terms in the sum has zero mean, also W,,’s mean vanishes. Moreover, 
Wm is a multiple of Zm of (2.4). Hence we have that Wy, converges to a 
normal distribution with zero mean and standard deviation 2.92m 2. 

Consequently the average of m tosses of the dice yields a random vari- 
able with mean 3.5 and it will approach a normal distribution with variance 
m-22.92. In other words, the empirical mean converges to its average at 
rate O(m-2). Figure 2.3 gives an illustration of the quality of the bounds 
implied by the CLT. 


One remarkable property of functions of random variables is that in many 
conditions convergence properties of the random variables are bestowed upon 
the functions, too. This is manifest in the following two results: a variant 
of Slutsky’s theorem and the so-called delta method. The former deals with 
limit behavior whereas the latter deals with an extension of the central limit 
theorem. 


Theorem 2.4 (Slutsky’s Theorem) Denote by X;,Y; sequences of ran- 
dom variables with X; > X and Y; > c for c € R in probability. Moreover, 
denote by g(x,y) a function which is continuous for all (x,c). In this case 
the random variable g(X;, Y;) converges in probability to g(X,c). 


For a proof see e.g. [Bil68]. Theorem 2.4 is often referred to as the continuous 
mapping theorem (Slutsky only proved the result for affine functions). It 
means that for functions of random variables it is possible to pull the limiting 
procedure into the function. Such a device is useful when trying to prove 
asymptotic normality and in order to obtain characterizations of the limiting 
distribution. 


Theorem 2.5 (Delta Method) Assume that X;, € R®% is asymptotically 
normal with a;?(X, —b) + N(0,%) for a2 — 0. Moreover, assume that 
g: R¢ > R! is a mapping which is continuously differentiable at b. In this 


case the random variable g(Xp) converges 


a,” (g(Xn) — 9(b)) + NO, [Veg ()]Z[Vz9(6)]"). (2.5) 


n 


Proof Via a Taylor expansion we see that 


ay,” [9(Xn) — 9(b)] = [Vg(En)] "an *(Xn — b) (2.6) 
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Here €, lies on the line segment [b, X;,]. Since X,, > b we have that €, —> b, 
too. Since g is continuously differentiable at b we may apply Slutsky’s the- 
orem to see that az? [g(Xn) — g9(b)] > [Vag(b)]'az?(Xn — 6). As a con- 
sequence, the transformed random variable is asymptotically normal with 
covariance [Vzg(b)]5[V2g(b)]'. a 


We will use the delta method when it comes to investigating properties of 
maximum likelihood estimators in exponential families. There g will play the 
role of a mapping between expectations and the natural parametrization of 
a distribution. 


2.1.2 The Characteristic Function 


The Fourier transform plays a crucial role in many areas of mathematical 
analysis and engineering. This is equally true in statistics. For historic rea- 
sons its applications to distributions is called the characteristic function, 
which we will discuss in this section. At its foundations lie standard tools 
from functional analysis and signal processing [Rud73, Pap62]. We begin by 
recalling the basic properties: 


Definition 2.6 (Fourier Transform) Denote by f : R” > C a function 
defined on a d-dimensional Euclidean space. Moreover, let x,w € R”. Then 
the Fourier transform F and its inverse F~' are given by 


d 
2 


FUf](@) := (20)~ f(x) exp(—t (w, x) )dax (2.7) 


g(w) exp(2 (w, z))dw. (2.8) 
R” 


Nia 


Fol [gl(x) == (2) 


The key insight is that F~!o F = Fo F~! = Id. In other words, F and 
F-! are inverses to each other for all functions which are Ly integrable on 
R?, which includes probability distributions. One of the key advantages of 
Fourier transforms is that derivatives and convolutions on f translate into 
multiplications. That is F[f og] = (20)? Ff] - F|g]. The same rule applies 
to the inverse transform, i.e. F~'[f og] = (27)2 Ff] Fg). 

The benefit for statistical analysis is that often problems are more easily 
expressed in the Fourier domain and it is easier to prove convergence results 
there. These results then carry over to the original domain. We will be 
exploiting this fact in the proof of the law of large numbers and the central 
limit theorem. Note that the definition of Fourier transforms can be extended 
to more general domains such as groups. See e.g. [BCR84] for further details. 
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We next introduce the notion of a characteristic function of a distribution.! 


Definition 2.7 (Characteristic Function) Denote by p(x) a distribution 
of a random variable X € R¢. Then the characteristic function dx(w) with 
w € R¢ is given by 


$x(w) = (20)? F™[p(a)] = [owt (w, x) )dp(a). (2.9) 


In other words, ¢x(w) is the inverse Fourier transform applied to the prob- 
ability measure p(x). Consequently ¢x(w) uniquely characterizes p(x) and 
moreover, p(x) can be recovered from ¢x(w) via the forward Fourier trans- 
form. One of the key utilities of characteristic functions is that they allow 
us to deal in easy ways with sums of random variables. 


Theorem 2.8 (Sums of random variables and convolutions) Denote 
by X,Y € R two independent random variables. Moreover, denote by Z := 
X+Y the sum of both random variables. Then the distribution over Z sat- 
isfies p(z) = p(x) o p(y). Moreover, the characteristic function yields: 


oz(w) = ox (w)oy(w). (2.10) 


Proof Z is given by Z = X + Y. Hence, for a given Z = z we have 
the freedom to choose X = «x freely provided that Y = z — x. In terms of 
distributions this means that the joint distribution p(z, x) is given by 


p(z, 2) = p(Y = z—2)p(2) 
and hence p(z) = f p(¥ = 2 — 2)dp(a) = (p(x) © p(y)|(2)- 


The result for characteristic functions follows form the property of the 
Fourier transform. a 


For sums of several random variables the characteristic function is the prod- 
uct of the individual characteristic functions. This allows us to prove both 
the weak law of large numbers and the central limit theorem (see Figure 2.4 
for an illustration) by proving convergence in the Fourier domain. 

Proof [Weak Law of Large Numbers] At the heart of our analysis lies 
a Taylor expansion of the exponential into 


exp(twx) = 1+i(w,2) + o(|w) 
and hence ¢x(w) = 1+ iwEx|z] + o(|w)). 


! In Chapter ?? we will discuss more general descriptions of distributions of which ¢x is a special 
case. In particular, we will replace the exponential exp(i (w,x)) by a kernel function k(x, 2’). 


44 2 Density Estimation 


1.0 1.0 1.0 1.0 1.0 
0.5 0.5 0.5 0.5 0.5 
0.0 0.0 0.0 0.0 0.0 

5 0 5 5 0 5 35 0 5 5 0 5 5 0 5 
1.5 15 15 15 1.5 
1.0 1.0 1.0 1.0 1.0 
0.5 0.5 0.5 0.5 0.5 
0.0 0.0 0.0 0.0 0.0 

101 -10 1 -10 1 101 1041 


Fig. 2.4. A working example of the central limit theorem. The top row contains 
distributions of sums of uniformly distributed random variables on the interval 
(0.5, 0.5]. From left to right we have sums of 1, 2,4,8 and 16 random variables. The 
bottom row contains the same distribution with the means rescaled by ,/m, where 
m is the number of observations. Note how the distribution converges increasingly 
to the normal distribution. 


Given m random variables X; with mean Ex [x] = w this means that their 
average Xm, := + >”, X; has the characteristic function 


bx,,(w) = (1 + “wy +o(m i) (2.11) 


In the limit of m — oo this converges to exp(twy), the characteristic func- 
tion of the constant distribution with mean yp. This proves the claim that in 
the large sample limit X,, is essentially constant with mean p. | 


Proof [Central Limit Theorem] We use the same idea as above to prove 
the CLT. The main difference, though, is that we need to assume that the 
second moments of the random variables X; exist. To avoid clutter we only 


prove the case of constant mean Ex, |x;] = js and variance Var x,[x;] = o. 
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Let Zm := ar yo (Xi — w). Our proof relies on showing convergence 
of the characteristic function of Z,, i.e. dz,, to that of a normally dis- 
tributed random variable W with zero mean and unit variance. Expanding 
the exponential to second order yields: 


1 
exp(iwx) = 1+iwa — sue + o(|w|?) 


1 
and hence ¢x(w) = 1+ iwEx [a] — aw Varx [2] + o(|w|*) 


Since the mean of Z,, vanishes by centering (X; — 4) and the variance per 


variable is m~! we may write the characteristic function of Zn, via 


bag (a) = (1 su? + of ui) 


As before, taking limits m — oo yields the exponential function. We have 
that lity +60 Oz, (Ww) = exp(—5w”) which is the characteristic function of 
the normal distribution with zero mean and variance 1. Since the character- 
istic function transform is injective this proves our claim. | 


Note that the characteristic function has a number of useful properties. For 
instance, it can also be used as moment generating function via the identity: 


V"$x(0) =i "Ex(x"]. (2.12) 


Its proof is left as an exercise. See Problem 2.2 for details. This connection 
also implies (subject to regularity conditions) that if we know the moments 
of a distribution we are able to reconstruct it directly since it allows us 
to reconstruct its characteristic function. This idea has been exploited in 
density estimation [Cra46] in the form of Edgeworth and Gram-Charlier 
expansions [Hal92]. 


2.1.3 Tail Bounds 


In practice we never have access to an infinite number of observations. Hence 
the central limit theorem does not apply but is just an approximation to the 
real situation. For instance, in the case of the dice, we might want to state 
worst case bounds for finite sums of random variables to determine by how 
much the empirical mean may deviate from its expectation. Those bounds 
will not only be useful for simple averages but to quantify the behavior of 
more sophisticated estimators based on a set of observations. 

The bounds we discuss below differ in the amount of knowledge they 
assume about the random variables in question. For instance, we might only 
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know their mean. This leads to the Gauss-Markov inequality. If we know 
their mean and their variance we are able to state a stronger bound, the 
Chebyshev inequality. For an even stronger setting, when we know that 
each variable has bounded range, we will be able to state a Chernoff bound. 
Those bounds are progressively more tight and also more difficult to prove. 
We state them in order of technical sophistication. 


Theorem 2.9 (Gauss-Markov) Denote by X > 0 a random variable and 
let pw be its mean. Then for any € > 0 we have 


Pr(X >) < = (2.13) 


Proof We use the fact that for nonnegative random variables 


px e)= fo dplays f° Zap(e) se / © rdp(a) = #. 


This means that for random variables with a small mean, the proportion of 
samples with large value has to be small. | 


Consequently deviations from the mean are O(e~!). However, note that this 
bound does not depend on the number of observations. A useful application 
of the Gauss-Markov inequality is Chebyshev’s inequality. It is a statement 
on the range of random variables using its variance. 


Theorem 2.10 (Chebyshev) Denote by X a random variable with mean 
uw and variance a”. Then the following holds for € > 0: 


Pr(|z — pl > €) < =. (2.14) 


Proof Denote by Y := |X — p|? the random variable quantifying the 
deviation of X from its mean pz. By construction we know that Ey [y] = 0°. 
Next let y := e?. Applying Theorem 2.9 to Y and y yields Pr(Y > y) < a?/y 


which proves the claim. | 


Note the improvement to the Gauss-Markov inequality. Where before we had 
bounds whose confidence improved with O(e~!) we can now state O(€~?) 
bounds for deviations from the mean. 


Example 2.2 (Chebyshev bound) Assume that Xp, := m7! SV", Xj is 
the average over m random variables with mean ps and variance o?. Hence 


Xm also has mean yu. Its variance is given by 


m 
Var x, [Zm] = m~?Varx, [aj] = m7'o?. 
i=1 
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Applying Chebyshev’s inequality yields that the probability of a deviation 
2 

of € from the mean is bounded by “+. For fixed failure probability 6 = 

Pr(|Xm — us| > €) we have 


6 <o°m'e and equivalently € < o/V m6. 


This bound is quite reasonable for large 6 but it means that for high levels 
of confidence we need a huge number of observations. 


Much stronger results can be obtained if we are able to bound the range 
of the random variables. Using the latter, we reap an exponential improve- 
ment in the quality of the bounds in the form of the McDiarmid [McD89] 
inequality. We state the latter without proof: 


Theorem 2.11 (McDiarmid) Denote by f : X™ > R a function on X 
and let X; be independent random variables. In this case the following holds: 


Pr(|f (tiysss tm) — Bx x, [P@iys++5%m)|| > €) S exp (—2e?C~?) : 


Here the constant C? is given by C? = pein c? where 


lf ip segte sss) 7 Cian? vega) |= @ 
for all 215.4558, 0, and for allt. 


This bound can be used for averages of a number of observations when 
they are computed according to some algorithm as long as the latter can be 
encoded in f. In particular, we have the following bound [Hoe63}: 


Theorem 2.12 (Hoeffding) Denote by X; iid random variables with bounded 
range X; € [a,b] and mean p. Let Xp, := mS”, Xj be their average. 
Then the following bound holds: 


me? 
Pr (|Xim — | >) < 2exp (-7) : (2.15) 


Proof This is a corollary of Theorem 2.11. In Xm each individual random 
variable has range [a/m,b/m] and we set f(X1,...,Xm) := Xm. Straight- 
forward algebra shows that C? = m~?(b — a)?. Plugging this back into 
McDiarmid’s theorem proves the claim. | 


Note that (2.15) is exponentially better than the previous bounds. With 


increasing sample size the confidence level also increases exponentially. 


Example 2.3 (Hoeffding bound) As in example 2.2 assume that X; are 
iid random variables and let Xm be their average. Moreover, assume that 
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X; € [a,b] for alli. As before we want to obtain guarantees on the probability 
that |Xm — | >. For a given level of confidence 1—6 we need to solve 
€2 
5 <2exp (- 3) (2.16) 
for «. Straightforward algebra shows that in this case € needs to satisfy 


€ > |b —al \/[log 2 — log 6] /2m (2.17) 


In other words, while the confidence level only enters logarithmically into the 
inequality, the sample size m improves our confidence only with € = O(m-2). 
That is, in order to improve our confidence interval from € = 0.1 to € = 0.01 
we need 100 times as many observations. 


While this bound is tight (see Problem 2.5 for details), it is possible to ob- 
tain better bounds if we know additional information. In particular knowing 
a bound on the variance of a random variable in addition to knowing that it 
has bounded range would allow us to strengthen the statement considerably. 
The Bernstein inequality captures this connection. For details see [BBLO05] 
or works on empirical process theory [vdVW96, SW86, Vap82]. 


2.1.4 An Example 


It is probably easiest to illustrate the various bounds using a concrete exam- 
ple. In a semiconductor fab processors are produced on a wafer. A typical 
300mm wafer holds about 400 chips. A large number of processing steps 
are required to produce a finished microprocessor and often it is impossible 
to assess the effect of a design decision until the finished product has been 
produced. 

Assume that the production manager wants to change some step from 
process ’A’ to some other process ’B’. The goal is to increase the yield of 
the process, that is, the number of chips of the 400 potential chips on the 
wafer which can be sold. Unfortunately this number is a random variable, 
i.e. the number of working chips per wafer can vary widely between different 
wafers. Since process ’A’ has been running in the factory for a very long 
time we may assume that the yield is well known, say it is 44 = 350 out 
of 400 processors on average. It is our goal to determine whether process 
’B’ is better and what its yield may be. Obviously, since production runs 
are expensive we want to be able to determine this number as quickly as 
possible, i.e. using as few wafers as possible. The production manager is risk 
averse and wants to ensure that the new process is really better. Hence he 
requires a confidence level of 95% before he will change the production. 
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A first step is to formalize the problem. Since we know process ’A’ exactly 
we only need to concern ourselves with ’B’. We associate the random variable 
X; with wafer 7. A reasonable (and somewhat simplifying) assumption is to 
posit that all X; are independent and identically distributed where all X; 
have the mean pg. Obviously we do not know wg — otherwise there would 
be no reason for testing! We denote by Xm the average of the yields of m 
wafers using process ’B’. What we are interested in is the accuracy e€ for 
which the probability 


6 = Pr(|Xm — [p| > ©) satisfies 6 < 0.05. 


Let us now discuss how the various bounds behave. For the sake of the 
argument assume that wp — “a = 20, i.e. the new process produces on 
average 20 additional usable chips. 


Chebyshev In order to apply the Chebyshev inequality we need to bound 
the variance of the random variables X;. The worst possible variance would 
occur if X; € {0;400} where both events occur with equal probability. In 
other words, with equal probability the wafer if fully usable or it is entirely 
broken. This amounts to a? = 0.5(200 — 0)? + 0.5(200 — 400)? = 40, 000. 
Since for Chebyshev bounds we have 


6<o*m te? (2.18) 


we can solve for m = o?/de? = 40, 000/(0.05- 400) = 20, 000. In other words, 
we would typically need 20,000 wafers to assess with reasonable confidence 
whether process ’B’ is better than process ’A’. This is completely unrealistic. 

Slightly better bounds can be obtained if we are able to make better 
assumptions on the variance. For instance, if we can be sure that the yield 
of process ’B’ is at least 300, then the largest possible variance is 0.25(300 — 
0)? + 0.75(300 — 400)? = 30,000, leading to a minimum of 15,000 wafers 
which is not much better. 


Hoeffding Since the yields are in the interval {0,...,400} we have an ex- 
plicit bound on the range of observations. Recall the inequality (2.16) which 
bounds the failure probably 6 = 0.05 by an exponential term. Solving this 
for m yields 


m > 0.5|b — al?e~? log(2/5) = 737.8 (2.19) 


In other words, we need at lest 738 wafers to determine whether process ’B’ 
is better. While this is a significant improvement of almost two orders of 
magnitude, it still seems wasteful and we would like to do better. 
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Central Limit Theorem The central limit theorem is an approximation. 
This means that our reasoning is not accurate any more. That said, for 
large enough sample sizes, the approximation is good enough to use it for 
practical predictions. Assume for the moment that we knew the variance o? 
exactly. In this case we know that X,, is approximately normal with mean 
jp and variance m~!o?. We are interested in the interval [j—e, 2 +e] which 
contains 95% of the probability mass of a normal distribution. That is, we 


need to solve the integral 


os eS on (-—) dx = 0.95 (2.20) 


Ua ce 20 


This can be solved efficiently using the cumulative distribution function of 
a normal distribution (see Problem 2.3 for more details). One can check 
that (2.20) is solved for « = 2.960. In other words, an interval of +2.960 
contains 95% of the probability mass of a normal distribution. The number 


of observations is therefore determined by 


2 
€ = 2.960/\/m and hence m = 8.765 (2.21) 


Again, our problem is that we do not know the variance of the distribution. 
Using the worst-case bound on the variance, i.e. ¢? = 40,000 would lead to 
a requirement of at least m = 876 wafers for testing. However, while we do 
not know the variance, we may estimate it along with the mean and use the 
empirical estimate, possibly plus some small constant to ensure we do not 
underestimate the variance, instead of the upper bound. 

Assuming that fluctuations turn out to be in the order of 50 processors, 
i.e. 0? = 2500, we are able to reduce our requirement to approximately 55 
wafers. This is probably an acceptable number for a practical test. 


Rates and Constants The astute reader will have noticed that all three 
confidence bounds had scaling behavior m = O(e~7). That: is, in all cases 
the number of observations was a fairly ill behaved function of the amount 
of confidence required. If we were just interested in convergence per se, a 
statement like that of the Chebyshev inequality would have been entirely 
sufficient. The various laws and bounds can often be used to obtain con- 
siderably better constants for statistical confidence guarantees. For more 
complex estimators, such as methods to classify, rank, or annotate data, 
a reasoning such as the one above can become highly nontrivial. See e.g. 
[MYA94, Vap98] for further details. 
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2.2 Parzen Windows 
2.2.1 Discrete Density Estimation 


The convergence theorems discussed so far mean that we can use empir- 
ical observations for the purpose of density estimation. Recall the case of 
the Naive Bayes classifier of Section 1.3.1. One of the key ingredients was 
the ability to use information about word counts for different document 
classes to estimate the probability p(w|y), where w? denoted the number 
of occurrences of word j in document x, given that it was labeled y. In the 
following we discuss an extremely simple and crude method for estimating 
probabilities. It relies on the fact that for random variables X; drawn from 
distribution p(x) with discrete values X; € X we have 


‘lim px(2) = ple) (2.22) 
where fx(x) := m7! Ss" {x; =z} for allxe X. (2.23) 


i=1 


Let us discuss a concrete case. We assume that we have 12 documents and 
would like to estimate the probability of occurrence of the word ’dog’ from 
it. As raw data we have: 


Document ID 12 3 4 5 6 7 8 9 10 11 12 


Occurrences of ‘dog’? 1 0 2 0 4 6 3 0 62 0 #41 
This means that the word ‘dog’ occurs the following number of times: 

Occurrences of ‘dog’ 0 12 3 4 5 6 

Number of documents 4 2 2 1 1 0 2 


Something unusual is happening here: for some reason we never observed 
5 instances of the word dog in our documents, only 4 and less, or alter- 
natively 6 times. So what about 5 times? It is reasonable to assume that 
the corresponding value should not be 0 either. Maybe we did not sample 
enough. One possible strategy is to add pseudo-counts to the observations. 
This amounts to the following estimate: 


Bx (w) = (m + [XI)~1[1 + )> {2s = 2} = ple) (2.24) 


i=1 


Clearly the limit for m — oo is still p(x). Hence, asymptotically we do not 
lose anything. This prescription is what we used in Algorithm 1.1 used a 
method called Laplace smoothing. Below we contrast the two methods: 
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Occurrences of ‘dog’ 0 1 2 3 4 5 6 
Number of documents 4 2 2 1 1 0 2 
Frequency of occurrence 0.33 0.17 0.17 0.083 0.083 0 0.17 
Laplace smoothing 0.26 0.16 0.16 O11 O11 £40.05 0.16 


The problem with this method is that as |X| increases we need increasingly 
more observations to obtain even a modicum of precision. On average, we 
will need at least one observation for every x € X. This can be infeasible for 
large domains as the following example shows. 


Example 2.4 (Curse of Dimensionality) Assume that X = {0,1}%, i.e. 
x consists of binary bit vectors of dimensionality d. As d increases the size of 
X increases exponentially, requiring an exponential number of observations 
to perform density estimation. For instance, if we work with images, a 100 x 


910 observations 


100 black and white picture would require in the order of 10° 
to model such fairly low-resolution images accurately. This is clearly utterly 
infeasible — the number of particles in the known universe is in the order 
of 108°. Bellman [Bel61] was one of the first to formalize this dilemma by 


coining the term ’curse of dimensionality’. 


This example clearly shows that we need better tools to deal with high- 
dimensional data. We will present one of such tools in the next section. 


2.2.2 Smoothing Kernel 


We now proceed to proper density estimation. Assume that we want to 
estimate the distribution of weights of a population. Sample data from a 
population might look as follows: X = {57, 88, 54, 84, 83, 59, 56, 43, 70, 63, 
90, 98, 102, 97, 106, 99, 103, 112}. We could use this to perform a density 
estimate by placing discrete components at the locations x; € X with weight 
1/|X| as what is done in Figure 2.5. There is no reason to believe that weights 
are quantized in kilograms, or grams, or miligrams (or pounds and stones). 
And even if it were, we would expect that similar weights would have similar 
densities associated with it. Indeed, as the right diagram of Figure 2.5 shows, 
the corresponding density is continuous. 

The key question arising is how we may transform X into a realistic 
estimate of the density p(x). Starting with a ’density estimate’ with only 
discrete terms 


» 5(a — a4) (2.25) 
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we may choose to smooth it out by a smoothing kernel h(a) such that the 
probability mass becomes somewhat more spread out. For a density estimate 
on X C R?@ this is achieved by 


p(z) = = » rth (2%). (2.26) 


This expansion is commonly known as the Parzen windows estimate. Note 
that obviously h must be chosen such that h(x) > 0 for all e € X and 
moreover that { h(x)dx = 1 in order to ensure that (2.26) is a proper prob- 
ability distribution. We now formally justify this smoothing. Let R be a 
small region such that 


a= | (a dx. 


Out of the m samples drawn from p(x), the probability that k of them fall 
in region R is given by the binomial distribution 


(") g'(a—qy*. 


The expected fraction of points falling inside the region can easily be com- 
puted from the expected value of the Binomial distribution: E[k/m] = gq. 
Similarly, the variance can be computed as Var[k/m] = q(1 — q)/m. As 
m —> co the variance goes to 0 and hence the estimate peaks around the 
expectation. We can therefore set 


k & mq. 
If we assume that R is so small that p(x) is constant over R, then 
q = p(x) -V, 
where V is the volume of R. Rearranging we obtain 


p(x) © — (2.27) 


Let us now set R to be a cube with side length 7, and define a function 


1 if lw) <4 
won = orl 


0 otherwise. 


Observe that h (=) is 1 if and only if x; lies inside a cube of size r centered 
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k= Son(2=4), 
i=1 


then one can use (2.27) to estimate p via 


around 7. If we let 


where r@ is the volume of the hypercube of size r in d dimensions. By symme- 
try, we can interpret this equation as the sum over m cubes centered around 
m data points x». If we replace the cube by any smooth kernel function h(-) 
this recovers (2.26). 

There exists a large variety of different kernels which can be used for the 
kernel density estimate. [Sil86] has a detailed description of the properties 
of a number of kernels. Popular choices are 


h(x) = Qn) zeae Gaussian kernel (2.28) 
h(x) = se =f) Laplace kernel (2.29) 
(2) = i max(0, 1 — 7) Epanechnikov kernel (2.30) 
he) = 5X[- 1,1) (2) Uniform kernel (2.31) 
h(az) = max(0, 1 — |z|) Triangle kernel. (2.32) 


Further kernels are the triweight and the quartic kernel which are basically 
powers of the Epanechnikov kernel. For practical purposes the Gaussian ker- 
nel (2.28) or the Epanechnikov kernel (2.30) are most suitable. In particular, 
the latter has the attractive property of compact support. This means that 
for any given density estimate at location x2 we will only need to evaluate 
terms h(x; — x) for which the distance ||x; — z|| is less than r. Such expan- 
sions are computationally much cheaper, in particular when we make use of 
fast nearest neighbor search algorithms [G199, [M98]. Figure 2.7 has some 
examples of kernels. 


2.2.3 Parameter Estimation 


So far we have not discussed the issue of parameter selection. It should be 
evident from Figure 2.6, though, that it is quite crucial to choose a good 
kernel width. Clearly, a kernel that is overly wide will oversmooth any fine 
detail that there might be in the density. On the other hand, a very narrow 
kernel will not be very useful, since it will be able to make statements only 
about the locations where we actually observed data. 
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Fig. 2.5. Left: a naive density estimate given a sample of the weight of 18 persons. 
Right: the underlying weight distribution. 
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Fig. 2.6. Parzen windows density estimate associated with the 18 observations of 
the Figure above. From left to right: Gaussian kernel density estimate with kernel 
of width 0.3, 1,3, and 10 respectively. 
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Fig. 2.7. Some kernels for Parzen windows density estimation. From left to right: 
Gaussian kernel, Laplace kernel, Epanechikov kernel, and uniform density. 


Moreover, there is the issue of choosing a suitable kernel function. The 
fact that a large variety of them exists might suggest that this is a crucial 
issue. In practice, this turns out not to be the case and instead, the choice 
of a suitable kernel width is much more vital for good estimates. In other 
words, size matters, shape is secondary. 

The problem is that we do not know which kernel width is best for the 
data. If the problem is one-dimensional, we might hope to be able to eyeball 
the size of r. Obviously, in higher dimensions this approach fails. A second 
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option would be to choose r such that the log-likelihood of the data is 
maximized. It is given by 


log ] | pla) = —mlogm+ keys (254) (2.33) 
i=1 i=1 j=l 


Remark 2.13 (Log-likelihood) We consider the logarithm of the likeli- 
hood for reasons of computational stability to prevent numerical underflow. 
While each term p(x;) might be within a suitable range, say 10-7, the prod- 
uct of 1000 of such terms will easily exceed the exponent of floating point 
representations on a computer. Summing over the logarithm, on the other 
hand, is perfectly feasible even for large numbers of observations. 


Unfortunately computing the log-likelihood is equally infeasible: for decreas- 
ing r the only surviving terms in (2.33) are the functions h((a; — xi)/r) = 
h(0), since the arguments of all other kernel functions diverge. In other 
words, the log-likelihood is maximized when p(x) is peaked exactly at the 
locations where we observed the data. The graph on the left of Figure 2.6 
shows what happens in such a situation. 

What we just experienced is a case of overfitting where our model is too 
flexible. This led to a situation where our model was able to explain the 
observed data “unreasonably well”, simply because we were able to adjust 
our parameters given the data. We will encounter this situation throughout 
the book. There exist a number of ways to address this problem. 


Validation Set: We could use a subset of our set of observations as an 
estimate of the log-likelihood. That is, we could partition the obser- 
vations into X t= 49j,...,¢,}) and X! t= {a y445.2452y,) and use 
the second part for a likelihood score according to (2.33). The second 
set is typically called a validation set. 

n-fold Cross-validation: Taking this idea further, note that there is no 
particular reason why any given x; should belong to X or X’ respec- 
tively. In fact, we could use all splits of the observations into sets 
X and X’ to infer the quality of our estimate. While this is compu- 
tationally infeasible, we could decide to split the observations into 
n equally sized subsets, say X 1,..., Xn, and use each of them as a 
validation set at a time while the remainder is used to generate a 
density estimate. 

Typically n is chosen to be 10, in which case this procedure is 
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referred to as 10-fold cross-validation. It is a computationally at- 
tractive procedure insofar as it does not require us to change the 
basic estimation algorithm. Nonetheless, computation can be costly. 
Leave-one-out Estimator: At the extreme end of cross-validation we could 
choose n = m. That is, we only remove a single observation at a time 
and use the remainder of the data for the estimate. Using the average 
over the likelihood scores provides us with an even more fine-grained 
estimate. Denote by p;(a) the density estimate obtained by using 


X := {x1,...,2%m} without x;. For a Parzen windows estimate this 
is given by 
pi(ai) = (m= 1) rth (2522) = 2 [plai) — 1 4H(0) 
j#i 
(2.34) 


Note that this is precisely the term r~%h(0) that is removed from 
the estimate. It is this term which led to divergent estimates for 
r — 0. This means that the leave-one-out log-likelihood estimate 
can be computed easily via 


L(X) = mlog + S- log [p(2s) —r—*h(0)| . (2.35) 
i=1 


We then choose r such that L(X) is maximized. This strategy is very 
robust and whenever it can be implemented in a computationally 
efficient manner, it is very reliable in performing model selection. 


An alternative, probably more of theoretical interest, is to choose the scale r 
a priort based on the amount of data we have at our disposition. Intuitively, 
we need a scheme which ensures that r — 0 as the number of observations 
increases m — oo. However, we need to ensure that this happens slowly 
enough that the number of observations within range r keeps on increasing in 
order to ensure good statistical performance. For details we refer the reader 
to [Sil86]. Chapter ?? discusses issues of model selection for estimators in 
general in considerably more detail. 


2.2.4 Silverman’s Rule 


Assume you are an aspiring demographer who wishes to estimate the popu- 
lation density of a country, say Australia. You might have access to a limited 
census which, for a random portion of the population determines where they 
live. As a consequence you will obtain a relatively high number of samples 
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Fig. 2.8. Nonuniform density. Left: original density with samples drawn from the 
distribution. Middle: density estimate with a uniform kernel. Right: density estimate 
using Silverman’s adjustment. 


of city dwellers, whereas the number of people living in the countryside is 
likely to be very small. 

If we attempt to perform density estimation using Parzen windows, we 
will encounter an interesting dilemma: in regions of high density (i.e. the 
cities) we will want to choose a narrow kernel width to allow us to model 
the variations in population density accurately. Conversely, in the outback, 
a very wide kernel is preferable, since the population there is very low. 
Unfortunately, this information is exactly what a density estimator itself 
could tell us. In other words we have a chicken and egg situation where 
having a good density estimate seems to be necessary to come up with a 
good density estimate. 

Fortunately this situation can be addressed by realizing that we do not 
actually need to know the density but rather a rough estimate of the latter. 
This can be obtained by using information about the average distance of the 
k nearest neighbors of a point. One of Silverman’s rules of thumb [Sil86] is 
to choose 1; as 


Cc 
i Ss) z—aill. (2.36) 


xEkN N(aj) 


Typically c is chosen to be 0.5 and & is small, e.g. k = 9 to ensure that the 
estimate is computationally efficient. The density estimate is then given by 


1 ~ —d LX; 
p(a) = — ae n(*). (2.37) 
Figure 2.8 shows an example of such a density estimate. It is clear that a 
locality dependent kernel width is better than choosing a uniformly constant 
kernel density estimate. However, note that this increases the computational 
complexity of performing a density estimate, since first the k nearest neigh- 
bors need to be found before the density estimate can be carried out. 
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2.2.5 Watson-Nadaraya Estimator 


Now that we are able to perform density estimation we may use it to perform 
classification and regression. This leads us to an effective method for non- 
parametric data analysis, the Watson-Nadaraya estimator [Wat64, Nad65]. 

The basic idea is very simple: assume that we have a binary classification 
problem, i.e. we need to distinguish between two classes. Provided that we 
are able to compute density estimates p(x) given a set of observations X we 
could appeal to Bayes rule to obtain 

p(aly)ply) _ Fat mag Diewmy™ “h (A) 


r=") Ee) 


Here we only take the sum over all x; with label y; = y in the numerator. 


The advantage of this approach is that it is very cheap to design such an 
estimator. After all, we only need to compute sums. The downside, similar 
to that of the k-nearest neighbor classifier is that it may require sums (or 
search) over a large number of observations. That is, evaluation of (2.38) is 
potentially an O(m) operation. Fast tree based representations can be used 
to accelerate this [BKLO6, K M00], however their behavior depends signifi- 
cantly on the dimensionality of the data. We will encounter computationally 
more attractive methods at a later stage. 

For binary classification (2.38) can be simplified considerably. Assume 
that y € {£1}. For p(y = 1|x) > 0.5 we will choose that we should estimate 
y = 1 and in the converse case we would estimate y = —1. Taking the 


difference between twice the numerator and the denominator we can see 
that the function 


y yih (= (=) 
fe) = Rees Sexy ee) 80) 


can be used to achieve the same goal since f(x) > 0 ==> p(y = l|x) > 0.5. 
Note that f(a) is a weighted combination of the labels y; associated with 


weights w;(2) which depend on the proximity of x to an observation 2;. 
In other words, (2.39) is a smoothed-out version of the k-nearest neighbor 
classifier of Section 1.3.2. Instead of drawing a hard boundary at the k closest 
observation we use a soft weighting scheme with weights w;(«) depending 
on which observations are closest. 

Note furthermore that the numerator of (2.39) is very similar to the simple 
classifier of Section 1.3.3. In fact, for kernels k(x, x’) such as the Gaussian 
RBF kernel, which are also kernels in the sense of a Parzen windows den- 


sity estimate, ie. k(x, 2’) = r~%h (s*) the two terms are identical. This 
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Fig. 2.9. Watson Nadaraya estimate. Left: a binary classifier. The optimal solution 
would be a straight line since both classes were drawn from a normal distribution 
with the same variance. Right: a regression estimator. The data was generated from 
a sinusoid with additive noise. The regression tracks the sinusoid reasonably well. 


means that the Watson Nadaraya estimator provides us with an alternative 
explanation as to why (1.24) leads to a usable classifier. 

In the same fashion as the Watson Nadaraya classifier extends the k- 
nearest neighbor classifier we also may construct a Watson Nadaraya re- 
gression estimator by replacing the binary labels y; by real-valued values 
yi € R to obtain the regression estimator 5°, y;wi(x). Figure 2.9 has an ex- 
ample of the workings of both a regression estimator and a classifier. They 
are easy to use and they work well for moderately dimensional data. 


2.3 Exponential Families 


Distributions from the exponential family are some of the most versatile 
tools for statistical inference. Gaussians, Poisson, Gamma and Wishart dis- 
tributions all form part of the exponential family. They play a key role in 
dealing with graphical models, classification, regression and conditional ran- 
dom fields which we will encounter in later parts of this book. Some of the 
reasons for their popularity are that they lead to convex optimization prob- 
lems and that they allow us to describe probability distributions by linear 
models. 


2.3.1 Basics 


Densities from the exponential family are defined by 


p(x; 4) := po(x) exp ((e(x), 8) — g(4)) - (2.40) 
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Here po(x) is a density on X and is often called the base measure, (2) is 
a map from zx to the sufficient statistics ¢(a). 9 is commonly referred to as 
the natural parameter. It lives in the space dual to ¢(#). Moreover, g(@) is a 
normalization constant which ensures that p(x) is properly normalized. g is 
often referred to as the log-partition function. The name stems from physics 
where Z = e9 denotes the number of states of a physical ensemble. g can 
be computed as follows: 


90) = log [exp ((O(2),0)) de. (2.41) 


Example 2.5 (Binary Model) Assume that X = {0;1} and that ¢(x) = 
x. In this case we have g(@) = log [e° +e ile log [1+]. It follows that 


pla = 0:0) = ee ond pe = 10} = a. In other words, by choosing 
different values of 9 one can recover different Bernoulli distributions. 


One of the convenient properties of exponential families is that the log- 
partition function g can be used to generate moments of the distribution 
itself simply by taking derivatives. 


Theorem 2.14 (Log partition function) The function g(@) is convex. 
Moreover, the distribution p(x; 0) satisfies 


Veg (8) = Ex [9(x)] and Vgg(@) = Varz [9(x)]- (2.42) 


Proof Note that V3g(9) = Var, [¢(x)| implies that g is convex, since the 
covariance matrix is positive semidefinite. To show (2.42) we expand 


Joe O(a) exp (P(x), 0) dar 
0 O\dz = . A 
Vogl) = oa ay = | He CaP) = Be (9(a)]. (2-48) 
Next we take the second derivative to obtain 
V39(0 ao — Vog(9)] pla; 8)da (2.44) 


T ciuailt nadaniade (2.45) 


which proves the claim. For the first equality we used (2.43). For the second 
line we used the definition of the variance. | 


One may show that higher derivatives Vjg(0) generate higher order cu- 
mulants of @(a) under p(x;@). This is why g is often also referred as the 
cumulant-generating function. Note that in general, computation of g() 
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is nontrivial since it involves solving a highdimensional integral. For many 
cases, in fact, the computation is NP hard, for instance when X is the do- 
main of permutations [I'J95]. Throughout the book we will discuss a number 
of approximation techniques which can be applied in such a case. 

Let us briefly illustrate (2.43) using the binary model of Example 2.5. 
We have that Vg = oa and Va = atone: This is exactly what we would 
have obtained from direct computation of the mean p(x = 1; 6) and variance 
p(x = 1;0) — p(x = 1; 0)? subject to the distribution p(z; 6). 


2.3.2 Examples 


A large number of densities are members of the exponential family. Note, 
however, that in statistics it is not common to express them in the dot 
product formulation for historic reasons and for reasons of notational com- 
pactness. We discuss a number of common densities below and show why 
they can be written in terms of an exponential family. A detailed description 
of the most commonly occurring types are given in a table. 


Gaussian Let x, € R®@ and let & € R&%4 where © > 0, that is, © is a 
positive definite matrix. In this case the normal distribution can be 
expressed via 


ple) = @ny$a|-$exp (—Fe—WTEMe—W)) Ao) 


an (2" [ety] +t ([-5e"| =~) - du)) 


where c(, 4) = 5H Oty + $ log Qa + 5 log ||. By combining the 
terms in x into ¢() := (z,—}2a!) we obtain the sufficient statistics 
of x. The corresponding linear coefficients (O~!1, 5~!) constitute the 
natural parameter 6. All that remains to be done to express p(z) in 
terms of (2.40) is to rewrite g(@) in terms of c(jz, ©). The summary 
table on the following page contains details. 

Multinomial Another popular distribution is one over k discrete events. 
In this case X = {1,...,k} and we have in completely generic terms 
p(x) = Tz where mz > 0 and )°,. 7; = 1. Now denote by ez € R* the 
x-th unit vector of the canonical basis, that is (ez,e,/) = 1 if « = 2’ 
and 0 otherwise. In this case we may rewrite p(x) via 


p(x) = Tz, = exp ((ez, log 7) ) (2.47) 


where log 7 = (log 7,..., log 7). In other words, we have succeeded 
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in rewriting the distribution as a member of the exponential family 
where ¢(x) = e, and where 0 = logz. Note that in this definition 6 
is restricted to a k—1 dimensional manifold (the k dimensional prob- 
ability simplex). If we relax those constraints we need to ensure that 
p(x) remains normalized. Details are given in the summary table. 


Poisson This distribution is often used to model distributions over discrete 
events. For instance, the number of raindrops which fall on a given 
surface area in a given amount of time, the number of stars in a 
given volume of space, or the number of Prussian soldiers killed by 
horse-kicks in the Prussian cavalry all follow this distribution. It is 
given by 

ene 


1 
p(x) = at gi &xP (alog A — A) where x € No. (2.48) 


By defining (a) = x we obtain an exponential families model. Note 
that things are a bit less trivial here since 4 is the nonuniform 
counting measure on No. The case of the uniform measure which 
leads to the exponential distribution is discussed in Problem 2.16. 


The reason why many discrete processes follow the Poisson distri- 
bution is that it can be seen as the limit over the average of a large 
number of Bernoulli draws: denote by z € {0,1} a random variable 
with p(z = 1) = a. Moreover, denote by z, the sum over n draws 
from this random variable. In this case z,, follows the multinomial 
distribution with p(z, = k) = ()a*(1 — a)"—*. Now assume that 
we let n — oo such that the expected value of z, remains constant. 
That is, we rescale a = A. In this case we have 


n! k nels 
P(2n = k) = os om ‘ (1 *) (2.49) 


~ ‘ (1 \" ae k)! (1 a 


For n — oo the second term converges to e~*. The third term con- 


verges to 1, since we have a product of only 2k terms, each of which 
converge to 1. Using the exponential families notation we may check 
that E{z] = A and that moreover also Var|z] = A. 


Beta This is a distribution on the unit interval X = [0,1] which is very 
versatile when it comes to modelling unimodal and bimodal distri- 
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Fig. 2.10. Left: Poisson distributions with A = {1,3, 10}. Right: Beta distributions 
with a = 2 and b € {1,2,3,5,7}. Note how with increasing b the distribution 
becomes more peaked close to the origin. 


butions. It is given by 


zi l(a + 8) 
T'(a)0(b) 


p(x) = 2* 1(1- 2) (2.50) 


Taking logarithms we see that this, too, is an exponential families 
distribution, since p(x) = exp((a — 1) logz + (b — 1) log(1 — x) + 
log I'(a + 6) — logI'(a) — logT(6)). 


Figure 2.10 has a graphical description of the Poisson distribution and the 
Beta distribution. For a more comprehensive list of exponential family dis- 
tributions see the table below and [Fel71, F194, MN&83]. In principle any 
map ¢(x), domain X with underlying measure p are suitable, as long as the 
log-partition function g(@) can be computed efficiently. 


Theorem 2.15 (Convex feasible domain) The domain of definition 0 


of g(@) is convex. 


Proof By construction g is convex and differentiable everywhere. Hence the 
below-sets for all values c with {z|g(x) < c} exist. Consequently the domain 
of definition is convex. | 


Having a convex function is very valuable when it comes to parameter infer- 
ence since convex minimization problems have unique minimum values and 
global minima. We will discuss this notion in more detail when designing 
maximum likelihood estimators. 
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2.4 Estimation 


In many statistical problems the challenge is to estimate parameters of in- 
terest. For instance, in the context of exponential families, we may want 
to estimate a parameter 6 such that it is close to the “true” parameter 6* 
in the distribution. While the problem is fully general, we will describe the 
relevant steps in obtaining estimates for the special case of the exponential 
family. This is done for two reasons — firstly, exponential families are an 
important special case and we will encounter slightly more complex variants 
on the reasoning in later chapters of the book. Secondly, they are of a suffi- 
ciently simple form that we are able to show a range of different techniques. 
In more advanced applications only a small subset of those methods may be 
practically feasible. Hence exponential families provide us with a working 
example based on which we can compare the consequences of a number of 
different techniques. 


2.4.1 Maximum Likelihood Estimation 


Whenever we have a distribution p(x;@) parametrized by some parameter 
@ we may use data to find a value of 0 which maximizes the likelihood that 
the data would have been generated by a distribution with this choice of 
parameter. 

For instance, assume that we observe a set of temperature measurements 
X = {x1,...,%m}. In this case, we could try finding a normal distribution 
such that the likelihood p(X; 6) of the data under the assumption of a normal 
distribution is maximized. Note that this does not imply in any way that the 
temperature measurements are actually drawn from a normal distribution. 
Instead, it means that we are attempting to find the Gaussian which fits the 
data in the best fashion. 

While this distinction may appear subtle, it is critical: we do not assume 
that our model accurately reflects reality. Instead, we simply try doing the 
best possible job at modeling the data given a specified model class. Later 
we will encounter alternative approaches at estimation, namely Bayesian 
methods, which make the assumption that our model ought to be able to 
describe the data accurately. 


Definition 2.16 (Maximum Likelihood Estimator) For a model p(-; 6) 
parametrized by @ and observations X the maximum likelihood estimator 
(MLE) is 


Outi [X] := argmax p(X; 0). (2.51) 
6 
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In the context of exponential families this leads to the following procedure: 
given m observations drawn iid from some distribution, we can express the 
joint likelihood as 


p(X; 9) = | p(w; 4) = | J exp ((o(ai), 4) — 9(9)) (2.52) 
i=1 i=1 
= exp (m ((u[X], 6) — 9(9))) (2.53) 
where ju[X] := ~ > 6(a%). (2.54) 
i=1 


Here ju[X] is the empirical average of the map ¢(x). Maximization of p(X; 0) 
is equivalent to minimizing the negative log-likelihood — log p(X;6). The 
latter is a common practical choice since for independently drawn data, 
the product of probabilities decomposes into the sum of the logarithms of 
individual likelihoods. This leads to the following objective function to be 
minimized 


— log p(X; ) = m [g(9) — (9, u[X])] (2.55) 


Since g(@) is convex and (6, j1[X]) is linear in 6, it follows that minimization 
of (2.55) is a convex optimization problem. Using Theorem 2.14 and the first 
order optimality condition Vgg(@) = u[X] for (2.55) implies that 


9 = [Vogl (uIX]) or equivalently B,~»(2;0)16(@)] = Vog(8) = H1X]. 
(2.56) 


Put another way, the above conditions state that we aim to find the distribu- 
tion p(x; 0) which has the same expected value of ¢(x) as what we observed 
empirically via y[X]. Under very mild technical conditions a solution to 
(2.56) exists. 

In general, (2.56) cannot be solved analytically. In certain special cases, 
though, this is easily possible. We discuss two such choices in the following: 
Multinomial and Poisson distributions. 


Example 2.6 (Poisson Distribution) For the Poisson distribution’ where 
p(2;0) = + exp(Ox — e*) it follows that g(0) = e® and ¢(x) =x. This allows 


1 Often the Poisson distribution is specified using A := log @ as its rate parameter. In this case we 
have p(#; A) = \®e—/a! as its parametrization. The advantage of the natural parametrization 
using 0 is that we can directly take advantage of the properties of the log-partition function as 
generating the cumulants of x. 
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us to solve (2.56) in closed form using 


1 m m 
Vog(0) =e? = a ae and hence 0 = log So ai — logm. (2.57) 
i=1 i=1 


Example 2.7 (Multinomial Distribution) For the multinomial distri- 
bution the log-partition function is given by g(@) = log ae e%, hence we 
have that 

efi 


Vig(9) = SN hi = ae =i}. (2.58) 
= i= 


It is easy to check that (2.58) is satisfied for e% = ae {x; =i}. In other 
words, the MLE for a discrete distribution simply given by the empirical 
frequencies of occurrence. 


The multinomial setting also exhibits two rather important aspects of ex- 
ponential families: firstly, choosing 0; = c+ log 7\",, {a; =i} for any cE R 
will lead to an equivalent distribution. This is the case since the sufficient 
statistic @(#) is not minimal. In our context this means that the coordinates 
of $(x) are linearly dependent — for any x we have that >? ,[¢(x)]; = 1, 
hence we could eliminate one dimension. This is precisely the additional 
degree of freedom which is reflected in the scaling freedom in @. 

Secondly, for data where some events do not occur at all, the expression 
log bea 1e¢= i}| = log 0 is ill defined. This is due to the fact that this 
particular set of counts occurs on the boundary of the convex set within 
which the natural parameters @ are well defined. We will see how different 
types of priors can alleviate the issue. 

Using the MLE is not without problems. As we saw in Figure 2.1, conver- 
gence can be slow, since we are not using any side information. The latter 
can provide us with problems which are both numerically better conditioned 
and which show better convergence, provided that our assumptions are ac- 
curate. Before discussing a Bayesian approach to estimation, let us discuss 
basic statistical properties of the estimator. 


2.4.2 Bias, Variance and Consistency 


When designing any estimator 6(X) we would like to obtain a number of 
desirable properties: in general it should not be biased towards a particular 
solution unless we have good reason to believe that this solution should 
be preferred. Instead, we would like the estimator to recover, at least on 


2.4 Estimation 69 


average, the “correct” parameter, should it exist. This can be formalized in 
the notion of an unbiased estimator. 

Secondly, we would like that, even if no correct parameter can be found, 
e.g. when we are trying to fit a Gaussian distribution to data which is not 
normally distributed, that we will converge to the best possible parameter 
choice as we obtain more data. This is what is understood by consistency. 

Finally, we would like the estimator to achieve low bias and near-optimal 
estimates as quickly as possible. The latter is measured by the efficiency 
of an estimator. In this context we will encounter the Cramér-Rao bound 
which controls the best possible rate at which an estimator can achieve this 
goal. Figure 2.11 gives a pictorial description. 


Fig. 2.11. Left: unbiased estimator; the estimates, denoted by circles have as mean 
the true parameter, as denoted by a star. Middle: consistent estimator. While the 
true model is not within the class we consider (as denoted by the ellipsoid), the 
estimates converge to the white star which is the best model within the class that 
approximates the true model, denoted by the solid star. Right: different estimators 
have different regions of uncertainty, as made explicit by the ellipses around the 
true parameter (solid star). 


Definition 2.17 (Unbiased Estimator) An estimator 6[X] is unbiased 


if for all 0 where X ~ p(X;@) we have Ex[6[X]] = 0. 


In other words, in expectation the parameter estimate matches the true pa- 
rameter. Note that this only makes sense if a true parameter actually exists. 
For instance, if the data is Poisson distributed and we attempt modeling it 
by a Gaussian we will obviously not obtain unbiased estimates. 

For finite sample sizes MLE is often biased. For instance, for the normal 
distribution the variance estimates carry bias O(m~'). See problem 2.19 
for details. In general, under fairly mild conditions, MLE is asymptotically 
unbiased [DGL96]. We prove this for exponential families. For more general 
settings the proof depends on the dimensionality and smoothness of the 
family of densities that we have at our disposition. 
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Theorem 2.18 (MLE for Exponential Families) Assume that X is an 
m-sample drawn iid from p(a; 0). The estimate 6[X] = g~'(u[X]) is asymp- 
totically normal with 


m2 [6[X] — 6] + N(0, [V39(9)]~)- (2.59) 


In other words, the estimate 6[X] is asymptotically normal, it converges to 
the true parameter 0, and moreover, the variance at the correct parameter 
is given by the inverse of the covariance matrix of the data, as given by the 
second derivative of the log-partition function V39(0). 

Proof Denote by js = Vgg(9) the true mean. Moreover, note that V79(0) is 
the covariance of the data drawn from p(x; 0). By the central limit theorem 
(Theorem 2.3) we have that n~2[p[X] — p] + N(0, V29(8)). 

Now note that 6[X] = [Vgg]~* (u[X]). Therefore, by the delta method 
(Theorem 2.5) we know that 6[X] is also asymptotically normal. Moreover, 
by the inverse function theorem the Jacobian of g~! satisfies V,, [Vag] * () = 
[V29(0)] Applying Slutsky’s theorem (Theorem 2.4) proves the claim. 


Now that we established the asymptotic properties of the MLE for exponen- 
tial families it is only natural to ask how much variation one may expect in 


6(X] when performing estimation. The Cramer-Rao bound governs this. 


Theorem 2.19 (Cramér and Rao [Rao73]) Assume that X is drawn from 
p(X; 4) and let 6[X] be an asymptotically unbiased estimator. Denote by I 
the Fisher information matrix and by B the variance of 0[X] where 

I := Cov [V¢ log p(X; 0)| and B := Var /4(x]| : (2.60) 


In this case det IB > 1 for all estimators 6[X]. 


Proof We prove the claim for the scalar case. The extension to matrices is 
straightforward. Using the Cauchy-Schwarz inequality we have 


Cov? Ve log p(X: 0), 6[x]| < Var [Vo log p(X; 9)] Var (d(x =IB. (2.61) 
Note that at the true parameter the expected log-likelihood score vanishes 


Ex [Vo log p(X; 0)| = Vo [x 0)dX = Vol =0. (2.62) 
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Hence we may simplify the covariance formula by dropping the means via 


Cov Vo log p(X; 6), a(x] = Ex Vo log p(X; 0)6[X] 
= / p(X; 0)6(X)Vo log p(X; 0)d0 
= Ve [ox 0)6(X)dX = Ved = 1. 


Here the last equality follows since we may interchange integration by X 
and the derivative with respect to 0. |_| 


The Cramér-Rao theorem implies that there is a limit to how well we may 
estimate a parameter given finite amounts of data. It is also a yardstick by 
which we may measure how efficiently an estimator uses data. Formally, we 
define the efficiency as the quotient between actual performance and the 
Cramér-Rao bound via 


e:=1/det IB. (2.63) 


The closer e is to 1, the lower the variance of the corresponding estimator 


6(X). Theorem 2.18 implies that for exponential families MLE is asymptot- 
ically efficient. It turns out to be generally true. 


Theorem 2.20 (Efficiency of MLE [Cra46, GW92, Ber85]) The maz- 
imum likelihood estimator is asymptotically efficient (e = 1). 


So far we only discussed the behavior of [x] whenever there exists a true 0 
generating p(0; X). If this is not true, we need to settle for less: how well [xX] 
approaches the best possible choice of within the given model class. Such 
behavior is referred to as consistency. Note that it is not possible to define 
consistency per se. For instance, we may ask whether 6 converges to the 
optimal parameter 6*, or whether p(2; 6) converges to the optimal density 
p(x; 0*), and with respect to which norm. Under fairly general conditions 
this turns out to be true for finite-dimensional parameters and smoothly 
parametrized densities. See [DGL96, vdGO00] for proofs and further details. 


2.4.8 A Bayesian Approach 


The analysis of the Maximum Likelihood method might suggest that in- 
ference is a solved problem. After all, in the limit, MLE is unbiased and it 
exhibits as small variance as possible. Empirical results using a finite amount 
of data, as present in Figure 2.1 indicate otherwise. 

While not making any assumptions can lead to interesting and general 
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theorems, it ignores the fact that in practice we almost always have some 
idea about what to expect of our solution. It would be foolish to ignore such 
additional information. For instance, when trying to determine the voltage 
of a battery, it is reasonable to expect a measurement in the order of 1.5V 
or less. Consequently such prior knowledge should be incorporated into the 
estimation process. In fact, the use of side information to guide estimation 
turns out to be the tool to building estimators which work well in high 
dimensions. 

Recall Bayes’ rule (1.15) which states that p(@|x) = ree) In our con- 
text this means that if we are interested in the posterior probability of 6 
assuming a particular value, we may obtain this using the likelihood (often 
referred to as evidence) of x having been generated by @ via p(x|@) and our 
prior belief p(@) that 6 might be chosen in the distribution generating x. 
Observe the subtle but important difference to MLE: instead of treating 0 
as a parameter of a density model, we treat 6 as an unobserved random 
variable which we may attempt to infer given the observations X. 

This can be done for a number of different purposes: we might want to 
infer the most likely value of the parameter given the posterior distribution 
p(@|X). This is achieved by 


Oytap(X) := argmax p(6|X) = argmin — log p(X|9) — log p(6). (2.64) 
é 0 


The second equality follows since p(X) does not depend on @. This estimator 
is also referred to as the Mazimum a Posteriori, or MAP estimator. It differs 
from the maximum likelihood estimator by adding the negative log-prior 
to the optimization problem. For this reason it is sometimes also referred 
to as Penalized MLE. Effectively we are penalizing unlikely choices 6 via 
— log p(6). . 

Note that using #yap(X) as the parameter of choice is not quite accurate. 
After all, we can only infer a distribution over @ and in general there is no 
guarantee that the posterior is indeed concentrated around its mode. A more 
accurate treatment is to use the distribution p(@|X) directly via 


p(a|X) = / p(.x|0)p(0|X).Ad. (2.65) 


In other words, we integrate out the unknown parameter @ and obtain the 
density estimate directly. As we will see, it is generally impossible to solve 
(2.65) exactly, an important exception being conjugate priors. In the other 
cases one may resort to sampling from the posterior distribution to approx- 
imate the integral. 

While it is possible to design a wide variety of prior distributions, this book 
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focuses on two important families: norm-constrained prior and conjugate 
priors. We will encounter them throughout, the former sometimes in the 
guise of regularization and Gaussian Processes, the latter in the context of 
exchangeable models such as the Dirichlet Process. 

Norm-constrained priors take on the form 


p(0) « exp(—A ||@ — 40119) for p,d >1and A> 0. (2.66) 


That is, they restrict the deviation of the parameter value 6 from some guess 
09. The intuition is that extreme values of 8 are much less likely than more 
moderate choices of @ which will lead to more smooth and even distributions 
p(2|@). 

A popular choice is the Gaussian prior which we obtain for p = d = 1 
and \ = 1/207. Typically one sets 09 = 0 in this case. Note that in (2.66) 
we did not spell out the normalization of p(@) — in the context of MAP 
estimation this is not needed since it simply becomes a constant offset in 
the optimization problem (2.64). We have 

Omap[X] = min [9(8) — (8, w[X])] + A |]8 — Alls (2.67) 
For d,p > 1 and A > 0 the resulting optimization problem is convex and it 
has a unique solution. Moreover, very efficient algorithms exist to solve this 
problem. We will discuss this in detail in Chapter 3. Figure 2.12 shows the 
regions of equal prior probability for a range of different norm-constrained 
priors. 

As can be seen from the diagram, the choice of the norm can have profound 
consequences on the solution. That said, as we will show in Chapter ??, the 
estimate 6a is well concentrated and converges to the optimal solution 
under fairly general conditions. 

An alternative to norm-constrained priors are conjugate priors. They are 
designed such that the posterior p(@|X) has the same functional form as the 
prior p(@). In exponential families such priors are defined via 


p(O|n, v) = exp ((nv, 8) — ng(@) — h(v,n)) where (2.68) 


h(v,n) = log f exp ((nv, 0) — ng(@)) dd. (2.69) 


Note that p(@|n,v) itself is a member of the exponential family with the 
feature map ¢(6) = (8,—g(@)). Hence h(v,n) is convex in (nv,n). Moreover, 
the posterior distribution has the form 


P(O[X) x p(XA)p(An, v) x exp ((mp[X] + nv, 0) — (m+ n)g(9)). (2.70) 
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Fig. 2.12. From left to right: regions of equal prior probability in R? for priors using 
the @,, £2 and €,, norm. Note that only the @, norm is invariant with regard to the 
coordinate system. As we shall see later, the €; norm prior leads to solutions where 
only a small number of coordinates is nonzero. 


That is, the posterior distribution has the same form as a conjugate prior 
with parameters mulx tay and m-+n. In other words, n acts like a phantom 
sample size and v is the corresponding mean parameter. Such an interpreta- 
tion is reasonable given our desire to design a prior which, when combined 
with the likelihood remains in the same model class: we treat prior knowl- 
edge as having observed virtual data beforehand which is then added to the 
actual set of observations. In this sense data and prior become completely 
equivalent — we obtain our knowledge either from actual observations or 
from virtual observations which describe our belief into how the data gen- 
eration process is supposed to behave. 

Eq. (2.70) has the added benefit of allowing us to provide an exact nor- 


malized version of the posterior. Using (2.68) we obtain that 


p(6|X) = exp (mpl) +nv,0) —(m+n)g(0) —h (mee om + n)) : 


m+n 


The main remaining challenge is to compute the normalization h for a range 
of important conjugate distributions. The table on the following page pro- 
vides details. Besides attractive algebraic properties, conjugate priors also 
have a second advantage — the integral (2.65) can be solved exactly: 


p(a[X) = / exp ((o(2), 9) — 9(0)) x 


exp (mul) +nv,0) —(m+n)g(6) —h (Zee im + n)) dé 


m+n 


Combining terms one may check that the integrand amounts to the normal- 
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ization in the conjugate distribution, albeit ¢(x) added. This yields 


p(a|X) = exp (n (me) in tn + 1) h (mee om n)) 


m+n+l m+n 


Such an expansion is very useful whenever we would like to draw x from 
p(«|X) without the need to obtain an instantiation of the latent variable 0. 
We provide explicit expansions in appendix 2. [GS04] use the fact that @ 
can be integrated out to obtain what is called a collapsed Gibbs sampler for 
topic models [BNJ03]. 


2.4.4 An Example 


Assume we would like to build a language model based on available doc- 
uments. For instance, a linguist might be interested in estimating the fre- 
quency of words in Shakespeare’s collected works, or one might want to 
compare the change with respect to a collection of webpages. While mod- 
els describing documents by treating them as bags of words which all have 
been obtained independently of each other are exceedingly simple, they are 
valuable for quick-and-dirty content filtering and categorization, e.g. a spam 
filter on a mail server or a content filter for webpages. 

Hence we model a document d as a multinomial distribution: denote by 
w; for i € {l,...,ma} the words in d. Moreover, denote by p(w|6) the 
probability of occurrence of word w, then under the assumption that the 
words are independently drawn, we have 


p(d|) = | [ (wile). (2.71) 
i=1 


It is our goal to find parameters 0 such that p(d|@) is accurate. For a given 
collection D of documents denote by m, the number of counts for word w 
in the entire collection. Moreover, denote by m the total number of words 
in the entire collection. In this case we have 


p(D|9) = [] pile) = [[pwiay”. (2.72) 


Finding suitable parameters 6 given D proceeds as follows: In a maximum 
likelihood model we set 
Mw 
w|0) = —. 2.73 
p(w|a) = (2.73) 
In other words, we use the empirical frequency of occurrence as our best 


guess and the sufficient statistic of D is ¢6(w) = ew, where e,, denotes the unit 
Mw 


vector which is nonzero only for the “coordinate” w. Hence p[D]y = ™. 
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We know that the conjugate prior of the multinomial model is a Dirichlet 
model. It follows from (2.70) that the posterior mode is obtained by replacing 
[D] by — Denote by ny := Vy +n the pseudo-counts arising from 
the conjugate prior with parameters (v,7). In this case we will estimate the 
probability of the word w as 


My + Ny Mwy + Ny 
p(w?) = = (2.74) 


m+n m+n 

In other words, we add the pseudo counts n, to the actual word counts my. 
This is particularly useful when the document we are dealing with is brief, 
that is, whenever we have little data: it is quite unreasonable to infer from 
a webpage of approximately 1000 words that words not occurring in this 
page have zero probability. This is exactly what is mitigated by means of 
the conjugate prior (v,7). 

Finally, let us consider norm-constrained priors of the form (2.66). In this 
case, the integral required for 


p(D) = i, p(D|0)p()d6 


x f exp (—A|]8 ~ dolls +m (ulD],8) ~ mg(6)) a8 


is intractable and we need to resort to an approximation. A popular choice 
is to replace the integral by p(D|6*) where 6* maximizes the integrand. This 
is precisely the MAP approximation of (2.64). Hence, in order to perform 
estimation we need to solve 


or » d 
minimize g(4) — (u[D], 0) + rm \|9 — Aoll;, - (2.75) 


A very simple strategy for minimizing (2.75) is gradient descent. That is for 
a given value of 8 we compute the gradient of the objective function and take 
a fixed step towards its minimum. For simplicity assume that d = p = 2 and 
\ = 1/207, that is, we assume that 0 is normally distributed with variance 
o” and mean 6. The gradient is given by 


Vo [log p(D,4)] = Bxnp(aj(0(2)] — n[D] + —[0— 0] (2.76) 


In other words, it depends on the discrepancy between the mean of (x) 
with respect to our current model and the empirical average pu[X], and the 
difference between 9 and the prior mean 9p. 

Unfortunately, convergence of the procedure 0 + 6 — nVo[...] is usually 
very slow, even if we adjust the steplength 7 efficiently. The reason is that 
the gradient need not point towards the minimum as the space is most likely 
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distorted. A better strategy is to use Newton’s method (see Chapter 3 for 
a detailed discussion and a convergence proof). It relies on a second order 
Taylor approximation 


1 
— log p(D,@ +6) © — log p(D, 0) + (6,G) + 50 Ho (2.77) 


where G and H are the first and second derivatives of — log p(D,0) with 
respect to 6. The quadratic expression can be minimized with respect to 6 
by choosing 6 = —H~!G and we can fashion an update algorithm from this 
by letting @ + 6—H~'!G. One may show (see Chapter 3) that Algorithm 2.1 
is quadratically convergent. Note that the prior on 9 ensures that A is well 
conditioned even in the case where the variance of ¢(2) is not. In practice this 
means that the prior ensures fast convergence of the optimization algorithm. 


Algorithm 2.1 Newton method for MAP estimation 
NewtonMAP(D) 
Initialize 0 = 6 


while not converged do 
Compute G = Ez~»(2\6)[(@)] — 
Compute H = Varz~(x\9)|e(x)] + al 
Update 6+ 06— H7!G 

end while 


return 0 


2.5 Sampling 


So far we considered the problem of estimating the underlying probability 
density, given a set of samples drawn from that density. Now let us turn to 
the converse problem, that is, how to generate random variables given the 
underlying probability density. In other words, we want to design a random 
variable generator. This is useful for a number of reasons: 

We may encounter probability distributions where optimization over suit- 
able model parameters is essentially impossible and where it is equally im- 
possible to obtain a closed form expression of the distribution. In these cases 
it may still be possible to perform sampling to draw examples of the kind 
of data we expect to see from the model. Chapter ?? discusses a number of 
graphical models where this problem arises. 

Secondly, assume that we are interested in testing the performance of a 
network router under different load conditions. Instead of introducing the 
under-development router in a live network and wreaking havoc, one could 
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estimate the probability density of the network traffic under various load 
conditions and build a model. The behavior of the network can then be 
simulated by using a probabilistic model. This involves drawing random 
variables from an estimated probability distribution. 

Carrying on, suppose that we generate data packets by sampling and see 
an anomalous behavior in your router. In order to reproduce and debug 
this problem one needs access to the same set of random packets which 
caused the problem in the first place. In other words, it is often convenient 
if our random variable generator is reproducible; At first blush this seems 
like a contradiction. After all, our random number generator is supposed 
to generate random variables. This is less of a contradiction if we consider 
how random numbers are generated in a computer — given a particular 
initialization (which typically depends on the state of the system, e.g. time, 
disk size, bios checksum, etc.) the random number algorithm produces a 
sequence of numbers which, for all practical purposes, can be treated as iid. 
A simple method is the linear congruential generator [PTV F94] 


Lig = (ax; + b) mode. 


The performance of these iterations depends significantly on the choice of the 
constants a, b,c. For instance, the GNU C compiler uses a = 1103515245, b = 
12345 and c = 2??. In general b and © need to be relatively prime and a— 1 
needs to be divisible by all prime factors of c and by 4. It is very much 
advisable not to attempt implementing such generators on one’s own unless 
it is absolutely necessary. 

Useful desiderata for a pseudo random number generator (PRNG) are that 
for practical purposes it is statistically indistinguishable from a sequence of 
iid data. That is, when applying a number of statistical tests, we will accept 
the null-hypothesis that the random variables are iid. See Chapter ?? for 
a detailed discussion of statistical testing procedures for random variables. 
In the following we assume that we have access to a uniform RNG U[0, 1] 
which draws random numbers uniformly from the range [0, 1]. 


2.5.1 Inverse Transformation 

We now consider the scenario where we would like to draw from some dis- 
tinctively non-uniform distribution. Whenever the latter is relatively simple 
this can be achieved by applying an inverse transform: 


Theorem 2.21 For z ~ p(z) with z € Z and an injective transformation 
¢:%—+X with inverse transform ¢—! on o(%) it follows that the random 
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Discrete pee Distribution Cumulative Density Function 
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Fig. 2.13. Left: discrete probability distribution over 5 possible outcomes. Right: 
associated cumulative distribution function. When sampling, we draw x uniformly 
at random from U(0, 1] and compute the inverse of F’. 


variable x := $(z) is drawn from |V,¢~'(x)| + p(d~*(x)). Here |V2o~!(z)| 
denotes the determinant of the Jacobian of ¢~'. 


This follows immediately by fetes a variable transformation for a mea- 
sure, ie. we change dp(z) to dp(¢~!(x)) |V2d~'(a)|. Such a conversion strat- 
egy is particularly useful for ae tela 


Corollary 2.22 Denote by p(x) a distribution on R with cumulative distri- 
bution function F(x’) = f* |. dp(x). Then the transformation x = $(z) = 
F-'(z) converts samples z ~ U[0,1] to samples drawn from p(z). 


We now apply this strategy to a number of univariate distributions. One of 
the most common cases is sampling from a discrete distribution. 


Example 2.8 (Discrete Distribution) In the case of a discrete distribu- 
tion over {1,...,k} the cumulative distribution function is a step-function 
with steps at {1,...,k} where the height of each step is given by the corre- 
sponding probability of the event. 

The implementation works as follows: denote by p € [0,1]* the vector of 
probabilities and denote by f € [0,1]* with f; = fi-1+p; and fy = py the 
steps of the cumulative distribution function. Then for a random variable z 
drawn from U[0,1] we obtain x = ¢(z) := argmin, {fi > z}. See Figure 2.13 
for an example of a distribution over 5 events. 
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Exponential Distribution Cumulative Distribution Function 
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Fig. 2.14. Left: Exponential distribution with \ = 1. Right: associated cumulative 
distribution function. When sampling, we draw « uniformly at random from U0, 1] 
and compute the inverse. 


Example 2.9 (Exponential Distribution) The density of a Exponential- 
distributed random variable is given by 


p(a|A) = Aexp(—Az) if AX >0 and x > 0. (2.78) 
This allows us to compute its cdf as 
F(a|X) = 1-exp(—Az)if A > 0 for x > 0. (2.79) 


Therefore to generate a Exponential random variable we draw z ~ U(0, 1] 
and solve x = 6(z) = F7+(z|A) = —A7!log(1 — z). Since z and 1— z are 
drawn from U[0,1] we can simplify this to x = —A~' log z. 


We could apply the same reasoning to the normal distribution in order to 
draw Gaussian random variables. Unfortunately, the cumulative distribution 
function of the Gaussian is not available in closed form and we would need 
resort to rather nontrivial numerical techniques. It turns out that there exists 
a much more elegant algorithm which has its roots in Gauss’ proof of the 
normalization constant of the Normal distribution. This technique is known 
as the Box-Miiller transform. 


Example 2.10 (Box-Miiller Transform) Denote by X,Y independent Gaus- 
sian random variables with zero mean and unit variance. We have 


p01 goa? = Seay) (2.80) 


1 
Z,y) = ——e 2” ——e 27 = 
p(x, y) om on a 
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Fig. 2.15. Red: true density of the standard normal distribution (red line) is con- 
trasted with the histogram of 20,000 random variables generated by the Box-Miiller 
transform. 


The key observation is that the joint distribution p(x, y) is radially symmet- 
ric, i.e. it only depends on the radius r2 = x? + y?. Hence we may perform 


a variable substitution in polar coordinates via the map @ where 
xz =rcos@ and y=rsin6 hence (x,y) = ¢ '(r, 9). (2.81) 
This allows us to express the density in terms of (r,0) via 


cos 0 ao | ro 


1 
= —eE 2 
20 


P(r, 8) = (G(r, 8)) |Vn9 (7, 8)| = sre 


—rsin@ rcosdé 


The fact that p(r,@) is constant in 6 means that we can easily sample 0 € 
(0, 27] by drawing a random variable, say z9 from U0, 1] and rescaling it with 


27. To obtain a sampler for r we need to compute the cumulative distribution 


function for p(r) = rea”: 


/ 


F(r') = | re-2" dr =1—e72"" and hence r = F(z) = ./—2log(1 — 2). 
0 
(2.82) 


Observing that z ~ U[0,1] implies that 1— z ~ U[0,1] yields the following 
sampler: draw zo, z, ~ U|0,1| and compute x and y by 


x= 1/—2log z, cos 27z and y = \/—2 log z, sin 279. 


Note that the Box-Miiller transform yields two independent Gaussian ran- 
dom variables. See Figure 2.15 for an example of the sampler. 
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Example 2.11 (Uniform distribution on the disc) A similar strategy 
can be employed when sampling from the unit disc. In this case the closed- 
form expression of the distribution is simply given by 


1 p20 2 
P(z,y) = . (2.83) 
0 otherwise 


Using the variable transform (2.81) yields 


p(r, 0) = p(ol(r, 0)) |V 00 2 (r, 6)| _ ‘i ifr<1 


(2.84) 
0 otherwise 


Integrating out 6 yields p(r) = 2r for r € [0,1] with corresponding CDF 
F(r) =r? for r € [0,1]. Hence our sampler draws z,,z ~ U[0,1] and then 
computes x = \/%p cos 2nzq and y = ./2z, sin 272%. 


2.5.2 Rejection Sampler 


All the methods for random variable generation that we looked at so far re- 
quire intimate knowledge about the pdf of the distribution. We now describe 
a general purpose method, which can be used to generate samples from an 
arbitrary distribution. Let us begin with sampling from a set: 


Example 2.12 (Rejection Sampler) Denote by X C X a set and let p be 
a density on X. Then a sampler for drawing from px(x) « p(x) fora Ee X 
and px(x) = 0 for x ¢ X, that is, px(x) = p(a|x € X) is obtained by the 
procedure: 

repeat 

draw x ~ p(x) 
until « © X 
return x 


That is, the algorithm keeps on drawing from p until the random variable is 
contained in X. The probability that this occurs is clearly p(X). Hence the 
larger p(X) the higher the efficiency of the sampler. See Figure 2.16. 


Example 2.13 (Uniform distribution on a disc) The procedure works 
trivially as follows: draw x,y ~ U[0,1]. Accept if (2x — 1)? + (2y—1)? <1 
and return sample (2a —1,2y—1). This sampler has efficiency 4 since this 
is the surface ratio between the unit square and the unit ball. 

Note that this time we did not need to carry out any sophisticated measure 
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Fig. 2.16. Rejection sampler. Left: samples drawn from the uniform distribution on 
[0, 1]?. Middle: the samples drawn from the uniform distribution on the unit disc 
are all the points in the grey shaded area. Right: the same procedure allows us to 
sample uniformly from arbitrary sets. 


Fig. 2.17. Accept reject sampling for the Beta(2,5) distribution. Left: Samples are 
generated uniformly from the blue rectangle (shaded area). Only those samples 
which fall under the red curve of the Beta(2,5) distribution (darkly shaded area) 
are accepted. Right: The true density of the Beta(2,5) distribution (red line) is 
contrasted with the histogram of 10,000 samples drawn by the rejection sampler. 


transform. This mathematical convenience came at the expense of a slightly 
less efficient sampler — about 21% of all samples are rejected. 


The same reasoning that we used to obtain a hard accept/reject procedure 
can be used for a considerably more sophisticated rejection sampler. The 
basic idea is that if, for a given distribution p we can find another distribution 
q which, after rescaling, becomes an upper envelope on p, we can use q to 
sample from and reject depending on the ratio between qg and p. 


Theorem 2.23 (Rejection Sampler) Denote by p and q distributions on 
X and let c be a constant such that such that cq(x) > p(x) for all x € X. 
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Then the algorithm below draws from p with acceptance probability co. 


repeat 
draw x ~ q(x) andt ~ U(0, 1] 
until ct < p(x) 


q(x) 
return x 


Proof Denote by Z the event that the sample drawn from q is accepted. 
Then by Bayes rule the probability Pr(a|Z) can be written as follows 


TAB). sf op 
Pr(2|Z) = lates 2 zm a (2.85) 
Here we used that Pr(Z) = f Pr(Z|x)q(x)dx = f c~'p(a)dx = cc}. a 


Note that the algorithm of Example 2.12 is a special case of such a rejection 


sampler — we majorize px by the uniform distribution rescaled by reSE 


Example 2.14 (Beta distribution) Recall that the Beta(a, b) distribution, 
as a member of the Exponential Family with sufficient statistics (log x, log(1— 
x)), is given by 


Pia 0) ga b-1 
b) = —— ¢* (1 -— Ze 
p(ala, b) Tare)” (l—2x)?™," (2.86) 
For given (a,b) one can verify (problem 2.25) that 
M:= (x|a,b) = = (2.87) 
= argmax p(2|a, pa 


provided a > 1. Hence, if we use as proposal distribution the uniform distri- 
bution U[0,1] with scaling factor c = p(M|a,b) we may apply Theorem 2.23. 
As illustrated in Figure 2.17, to generate a sample from Beta(a, b) we first 
generate a pair (x,t), uniformly at random from the shaded rectangle. A 
sample is retained if ct < p(ala,b), and rejected otherwise. The acceptance 
rate of this sampler is 7 


Example 2.15 (Normal distribution) We may use the Laplace distri- 
bution to generate samples from the Normal distribution. That is, we use 


q(2|r) = Ae (2.88) 


as the proposal distribution. For a normal distribution p = N(0,1) with zero 
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mean and unit variance it turns out that choosing A = 1 yields the most 
efficient sampling scheme (see Problem 2.27) with 


p(x) < iE alata = 1) 


As illustrated in Figure 2.18, we first generate x ~ q(a|A = 1) using the 
inverse transform method (see Example 2.9 and Problem 2.21) and t ~ 
U[0, 1]. Ift < \/2e/mp(x) we accept x, otherwise we reject it. The efficiency 
of this scheme is Js: 


Fig. 2.18. Rejection sampling for the Normal distribution (red curve). Samples are 


generated uniformly from the Laplace distribution rescaled by \/2e/7. Only those 
samples which fall under the red curve of the standard normal distribution (darkly 
shaded area) are accepted. 


While rejection sampling is fairly efficient in low dimensions its efficiency is 
unsatisfactory in high dimensions. This leads us to an instance of the curse of 
dimensionality [Bel61]: the pdf of a d-dimensional Gaussian random variable 
centered at 0 with variance o7 1 is given by 
1 wf 2 

(ala?) = (2) Bote" 2o2 
Now suppose that we want to draw from p(a|o”) by sampling from another 
Gaussian q with slightly larger variance p? > o?. In this case the ratio 
between both distributions is maximized at 0 and it yields 


(0|o?) a 
= 0) =[7 
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If suppose - = 1.01, and d = 1000, we find that c + 20960. In other words, 
we need to generate approximately 21,000 samples on the average from q to 
draw a single sample from p. We will discuss a more sophisticated sampling 
algorithms, namely Gibbs Sampling, in Section ??. It allows us to draw from 
rather nontrivial distributions as long as the distributions in small subsets 
of random variables are simple enough to be tackled directly. 


Problems 


Problem 2.1 (Bias Variance Decomposition {1}) Prove that the vari- 
ance Varx |x] of a random variable can be written as Ex [x7] — Ex[z]?. 


Problem 2.2 (Moment Generating Function {2}) Prove that the char- 
acteristic function can be used to generate moments as given in (2.12). Hint: 
use the Taylor expansion of the exponential and apply the differential oper- 
ator before the expectation. 


Problem 2.3 (Cumulative Error Function {2}) 
arte) = van f eda. (2.89) 
0 


Problem 2.4 (Weak Law of Large Numbers {2}) Jn analogy to the proof 
of the central limit theorem prove the weak law of large numbers. Hint: use 

a first order Taylor expansion of e' = 1+iwt+o(t) to compute an approz- 
imation of the characteristic function. Next compute the limit m — oo for 
x,,- Finally, apply the inverse Fourier transform to associate the constant 
distribution at the mean fu with it. 


Problem 2.5 (Rates and confidence bounds {3}) Show that the rate 
of hoeffding is tight — get bound from central limit theorem and compare to 
the hoeffding rate. 


Problem 2.6 Why can’t we just use each chip on the wafer as a random 
variable? Give a countererample. Give bounds if we actually were allowed to 
do this. 


Problem 2.7 (Union Bound) Work on many bounds at the same time. 
We only have logarithmic penalty. 


Problem 2.8 (Randomized Rounding {4}) Solve the linear system of 
equations Ax = b for integral x. 


2.5 Sampling 87 
Problem 2.9 (Randomized Projections {3}) Prove that the random- 
ized projections converge. 


Problem 2.10 (The Count-Min Sketch {5}) Prove the projection trick 


Problem 2.11 (Parzen windows with triangle kernels {1}) Suppose 
you are given the following data: X = {2,3,3,5,5}. Plot the estimated den- 
sity using a kernel density estimator with the following kernel: 


0.5 — 0.25 * |ul af |u| < 2 
k(u) = 
0 otherwise. 


Problem 2.12 Gaussian process link with Gaussian prior on natural pa- 
rameters 


Problem 2.13 Optimization for Gaussian regularization 
Problem 2.14 Conjugate prior (student-t and wishart). 


Problem 2.15 (Multivariate Gaussian {1}) Prove that 4 > 0 is a nec- 
essary and sufficient condition for the normal distribution to be well defined. 


Problem 2.16 (Discrete Exponential Distribution {2}) ¢(x) = x and 
uniform measure. 


Problem 2.17 Exponential random graphs. 


Problem 2.18 (Maximum Entropy Distribution) Show that exponen- 
tial families arise as the solution of the maximum entropy estimation prob- 
lem. 


Problem 2.19 (Maximum Likelihood Estimates for Normal Distributions) 
Derive the maximum likelihood estimates for a normal distribution, that is, 
show that they result in 


m 
So (i — py? (2.90) 
using the exponential families parametrization. Next show that while the 


mean estimate jt is unbiased, the variance estimate has a slight bias of O(4). 
To see this, take the expectation with respect to 6. 
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Problem 2.20 (cdf of Logistic random variable {1}) Show that the cdf 
of the Logistic random variable (??) is given by (??). 


Problem 2.21 (Double-exponential (Laplace) distribution {1}) Use 
the inverse-transform method to generate a sample from the double-exponential 
(Laplace) distribution (2.88). 


Problem 2.22 (Normal random variables in polar coordinates {1}) 
If X, and Xq are standard normal random variables and let (R,@) de- 
note the polar coordinates of the pair (X1,X2). Show that R? ~ x3 and 
6 ~ Unif[0, 27]. 


Problem 2.23 (Monotonically increasing mappings {1}) A mapping 
T:R-—-R is one-to-one if, and only if, T is monotonically increasing, that 
is, c > y implies that T(x) > T(y). 


Problem 2.24 (Monotonically increasing multi-maps {2}) LetT :R” > 
IR” be one-to-one. If X ~ px(x), then show that the distribution py(y) of 
Y =T(X) can be obtained via (?7?). 


Problem 2.25 (Argmax of the Beta(a,b) distribution {1}) Show that 
the mode of the Beta(a, b) distribution is given by (2.87). 


Problem 2.26 (Accept reject sampling for the unit disk {2}) Give at 
least TWO different accept-reject based sampling schemes to generate sam- 
ples uniformly at random from the unit disk. Compute their efficiency. 


Problem 2.27 (Optimizing Laplace for Standard Normal {1}) Optimize 
the ratio p(x)/g(a|u,o), with respect to u and o, where p(x) is the standard 
normal distribution (??), and g(a|u,o) is the Laplace distribution (2.88). 


Problem 2.28 (Normal Random Variable Generation {2}) The aim 
of this problem is to write code to generate standard normal random vari- 
ables (??) by using different methods. To do this generate U ~ Unif[0, 1] 
and apply 


(i) the Box-Muller transformation outlined in Section ??. 
(ii) use the following approximation to the inverse CDF 


ag + ayt 
1+ bit + bot?’ 


bla) & (2.91) 
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where t? = log(a~*) and 
ag = 2.30753, a, = 0.27061, b; = 0.99229, bg = 0.04481 
(iii) use the method outlined in example 2.15. 


Plot a histogram of the samples you generated to confirm that they are nor- 
mally distributed. Compare these different methods in terms of the time 
needed to generate 1000 random variables. 


Problem 2.29 (Non-standard Normal random variables {2}) Describe 
a scheme based on the Box-Muller transform to generate d dimensional nor- 
mal random variables p(x|0,I). How can this be used to generate arbitrary 
normal random variables p(x|, X). 


Problem 2.30 (Uniform samples from a disk {2}) Show how the ideas 
described in Section ?? can be generalized to draw samples uniformly at ran- 


2 2 
dom from an axis parallel ellipse: {(x,y) : =<} + 7 < J}. 
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Optimization 


Optimization plays an increasingly important role in machine learning. For 
instance, many machine learning algorithms minimize a regularized risk 
functional: 


min J(f) = AO(f) + Remplf); (3.1) 


with the empirical risk 


m 
Remp(f) = — SUF (21), 4). (3.2) 
i=l 
Here x; are the training instances and y; are the corresponding labels. | the 
loss function measures the discrepancy between y and the predictions f(2;). 
Finding the optimal f involves solving an optimization problem. 

This chapter provides a self-contained overview of some basic concepts and 
tools from optimization, especially geared towards solving machine learning 
problems. In terms of concepts, we will cover topics related to convexity, 
duality, and Lagrange multipliers. In terms of tools, we will cover a variety 
of optimization algorithms including gradient descent, stochastic gradient 
descent, Newton’s method, and Quasi-Newton methods. We will also look 
at some specialized algorithms tailored towards solving Linear Programming 
and Quadratic Programming problems which often arise in machine learning 
problems. 


3.1 Preliminaries 


Minimizing an arbitrary function is, in general, very difficult, but if the ob- 
jective function to be minimized is convex then things become considerably 
simpler. As we will see shortly, the key advantage of dealing with convex 
functions is that a local optima is also the global optima. Therefore, well 
developed tools exist to find the global minima of a convex function. Conse- 
quently, many machine learning algorithms are now formulated in terms of 
convex optimization problems. We briefly review the concept of convex sets 
and functions in this section. 
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3.1.1 Convex Sets 


Definition 3.1 (Convex Set) A subset C of R” is said to be convex if 
(1—A)a+Ay eC wheneverx € CyyeEC and0<A<1. 


Intuitively, what this means is that the line joining any two points x and y 
from the set C lies inside C' (see Figure 3.1). It is easy to see (Exercise 3.1) 
that intersections of convex sets are also convex. 


Fig. 3.1. The convex set (left) contains the line joining any two points that belong 
to the set. A non-convex set (right) does not satisfy this property. 


A vector sum ) 7, 4x; is called a convex combination if A; > 0 and 50, A; = 
1. Convex combinations are helpful in defining a convex hull: 


Definition 3.2 (Convex Hull) The convex hull, conv(X), of a finite sub- 
set X = {x1,...,2%n} of R” consists of all convex combinations of x1,...,Xn.- 


3.1.2 Convex Functions 


Let f be a real valued function defined on a set X C R”. The set 


{(z,u): cE X,peER, p> f(x)} (3.3) 


is called the epigraph of f. The function f is defined to be a convex function 
if its epigraph is a convex set in R"t!. An equivalent, and more commonly 
used, definition (Exercise 3.5) is as follows (see Figure 3.2 for geometric 
intuition): 


Definition 3.3 (Convex Function) A function f defined on a set X is 
called convex if, for any x,2' € X and any 0 <A <1 such that Ax + (1 — 
A)a’ € X, we have 


f(Ar+ (1 —A)az’) < Af (x) + (1 - A) (2’). (3.4) 
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A function f is called strictly convex if 

f(Av + (1—A)a’) < Af(x) + (1 - A) f(z’) (3.5) 
whenever x # x’. 


In fact, the above definition can be extended to show that if f is a convex 
function and A; > 0 with 7, A; = 1 then 


f (= sn) = Dy rif (zi). (3.6) 


The above inequality is called the Jensen’s inequality (problem ). 


Fig. 3.2. A convex function (left) satisfies (3.4); the shaded region denotes its epi- 
graph. A nonconvex function (right) does not satisfy (3.4). 
If f : X > R is differentiable, then f is convex if, and only if, 
f(z’) > f(x) + (a — 2, Vf(a)) for all 2,2’ € X. (3.7) 


In other words, the first order Taylor approximation lower bounds the convex 
function universally (see Figure 3.4). Here and in the rest of the chapter 
(x,y) denotes the Euclidean dot product between vectors x and y, that is, 


(2,y) = Dorit (3.8) 


If f is twice differentiable, then f is convex if, and only if, its Hessian is 
positive semi-definite, that is, 


V2 f(z) > 0. (3.9) 
For twice differentiable strictly convex functions, the Hessian matrix is pos- 


itive definite, that is, V?f(x2) + 0. We briefly summarize some operations 
which preserve convexity: 
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Addition 
Scaling 
Affine Transform 
Adding a Linear Function 


Subtracting a Linear Function 


Pointwise Maximum 
Scalar Composition 
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If fi and fo are convex, then f; + fo is also convex. 
If f is convex, then af is convex for a > 0. 

If f is convex, then g(x) = f(Ax +) for some matrix 
A and vector 6 is also convex. 

If f is convex, then g(x) = f(#)+(a, x) for some vector 
a is also convex. 

If f is convex, then g(a) = f(x)—(a, x) for some vector 
a is also convex. 

If f; are convex, then g(x) = max; f;(x) is also convex. 
If f(z) = h(g(x)), then f is convex if a) g is convex, 
and h is convex, non-decreasing or b) g is concave, and 


h is convex, non-increasing. 


/ NS, 


Fig. 3.3. Left: Convex Function in two variables. Right: the corresponding convex 
below-sets {z| f(a) < c}, for different values of c. This is also called a contour plot. 


There is an intimate relation between convex functions and convex sets. 
For instance, the following lemma show that the below sets (level sets) of 
convex functions, sets for which f(x) < c, are convex. 


Lemma 3.4 (Below-Sets of Convex Functions) Denote by f: X ~R 


a convex function. Then the set 
Xo := {x|x eX and f(x) <c}, for allc ER, (3.10) 


1s convex. 


Proof For any z,x’ € X;., we have f(x), f(x’) < c. Moreover, since f is 
convex, we also have 


f(Ar+(1—A)a’) < Af(x) + (1 -A) f(a’) < c for alO<A<1. (3.11) 


Hence, for all 0 < \ < 1, we have (Ax + (1 — A)a’) € Xe, which proves the 
claim. Figure 3.3 depicts this situation graphically. | 
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As we hinted in the introduction of this chapter, minimizing an arbitrary 
function on a (possibly not even compact) set of arguments can be a difficult 
task, and will most likely exhibit many local minima. In contrast, minimiza- 
tion of a convex objective function on a convex set exhibits exactly one global 
minimum. We now prove this property. 


Theorem 3.5 (Minima on Convex Sets) /f the convex function f : X > 
R attains its minimum, then the set of x € X, for which the minimum value 
is attained, is a convex set. Moreover, if f is strictly convex, then this set 
contains a single element. 


Proof Denote by c the minimum of f on X. Then the set X. := {x|x € 
X and f(x) < c} is clearly convex. 

If f is strictly convex, then for any two distinct z,2’ € X, and any 0 < 
A <1 we have 


fr+(1— Aja’) < Af(x) + (1 -A) f(a") = Ac+ (1-A)c=c, 


which contradicts the assumption that f attains its minimum on X,. There- 
fore X, must contain only a single element. | 


As the following lemma shows, the minimum point can be characterized 
precisely. 


Lemma 3.6 Let f : X > R be a differentiable convex function. Then x is 
a minimizer of f, if, and only if, 


(x' — 2, Vf(x)) > 0 for all 2’. (3.12) 


Proof To show the forward implication, suppose that x is the optimum 
but (3.12) does not hold, that is, there exists an x’ for which 


(a' — 2, Vf(x)) <0. 


Consider the line segment z(A) = (1 — A)a + Av’, with 0 < A < 1. Since X 
is convex, z(A) lies in X. On the other hand, 


« (2(d))| = (a! — 2, Vf(0)) <0, 
A=0 
which shows that for small values of \ we have f(z(A)) < f(x), thus showing 
that x is not optimal. 
The reverse implication follows from (3.7) by noting that f(x’) > f(x), 
whenever (3.12) holds. 
| 
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One way to ensure that (3.12) holds is to set V f(x) = 0. In other words, 
minimizing a convex function is equivalent to finding a x such that V f(x) = 
0. Therefore, the first order conditions are both necessary and sufficient 
when minimizing a convex function. 


3.1.3 Subgradients 


So far, we worked with differentiable convex functions. The subgradient is a 
generalization of gradients appropriate for convex functions, including those 
which are not necessarily smooth. 


Definition 3.7 (Subgradient) Suppose x is a point where a convex func- 
tion f is finite. Then a subgradient is the normal vector of any tangential 
supporting hyperplane of f at x. Formally ys is called a subgradient of f at 
x if, and only if, 


f(z’) > f(x) + (a' — 2, p) for all 2’. (3.13) 


The set of all subgradients at a point is called the subdifferential, and is de- 
noted by 0, f(x). If this set is not empty then f is said to be subdifferentiable 
at x. On the other hand, if this set is a singleton then, the function is said 
to be differentiable at x. In this case we use V f(x) to denote the gradient 
of f. Convex functions are subdifferentiable everywhere in their domain. We 
now state some simple rules of subgradient calculus: 


Addition 0,(f1() + fo(x)) = O2fi(v) + Ox fo(x) 
Scaling 0, = a0, f(x), for a > 0 
Affine Transform If g(x) = f(Ax +6) for some matrix A and vector 6, 
then 0,9(2) = A'Od,f(y). 
Pointwise Maximum If g(x) = max; f;(x) then Og(x) = conv(0z fi) where 
i’ € argmax, fi (x). 


The definition of a subgradient can also be understood geometrically. As 
illustrated by Figure 3.4, a differentiable convex function is always lower 
bounded by its first order Taylor approximation. This concept can be ex- 
tended to non-smooth functions via subgradients, as Figure 3.5 shows. 

By using more involved concepts, the proof of Lemma 3.6 can be extended 
to subgradients. In this case, minimizing a convex nonsmooth function en- 
tails finding a x such that 0 € Of(z). 
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3.1.4 Strongly Convex Functions 


When analyzing optimization algorithms, it is sometimes easier to work with 
strongly convex functions, which generalize the definition of convexity. 


Definition 3.8 (Strongly Convex Function) A convex function f is o- 
strongly convex if, and only if, there exists a constant 0 > 0 such that the 
function f(a) — $ \la:||? is convex. 


The constant o is called the modulus of strong convexity of f. If f is twice 
differentiable, then there is an equivalent, and perhaps easier, definition of 
strong convexity: f is strongly convex if there exists a a0 such that 


V7 f(a) = ol. (3.14) 


In other words, the smallest eigenvalue of the Hessian of f is uniformly 
lower bounded by o everywhere. Some important examples of strongly con- 
vex functions include: 


Example 3.1 (Squared Euclidean Norm) The function f(x) = 4 \|x||” 
is X-strongly convex. 


Example 3.2 (Negative Entropy) Let A” = {x s.t. $0, 2; =1 and x; > 0} 
be the n dimensional simplex, and f : A" + R be the negative entropy: 


ff= ya log x;. (3.15) 


Then f is 1-strongly convex with respect to the ||-||, norm on the simplex 
(see Problem 3.7). 


If f is a o-strongly convex function then one can show the following prop- 
erties (Exercise 3.8). Here x, 2’ are arbitrary and yz € Of (x) and p’ € Of (2’). 


f(a’) = f(w) + (a — 2,4) 4 5 lle’ ||? (3.16) 
Fal) < fle) + (0! —2,n) + 5 I - wl? (3.17) 
(e—2',p—pl) > ole —2'|| (3.18) 
(p—a',u-w)s< s Ie — HII. (3.19) 
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3.1.5 Convex Functions with Lipschitz Continous Gradient 

A somewhat symmetric concept to strong convexity is the Lipschitz conti- 
nuity of the gradient. As we will see later they are connected by Fenchel 
duality. 


Definition 3.9 (Lipschitz Continuous Gradient) A differentiable con- 
vex function f is said to have a Lipschitz continuous gradient, if there exists 
a constant L > 0, such that 


|Vi(z) -—VF(2’)|| < Llla—2'l| Vaya". (3.20) 


As before, if f is twice differentiable, then there is an equivalent, and perhaps 
easier, definition of Lipschitz continuity of the gradient: f has a Lipschitz 
continuous gradient strongly convex if there exists a L such that 


LTS Vv" fa). (3.21) 


In other words, the largest eigenvalue of the Hessian of f is uniformly upper 
bounded by L everywhere. If f has a Lipschitz continuous gradient with 
modulus L, then one can show the following properties (Exercise 3.9). 


f(a!) < f(a) + (a! — 2, Vf(2)) tale (3.22) 
f(a’) > f(a) + _ —ax,Vf(x)) + a7 IIVF(@) ~ Vf (zx")||? (3.23) 
Cee eae dt 2) <be—h (3.24) 
(x—2', VF (a z')) > = (IF) — VF@)IP (3.25) 


3.1.6 Fenchel Duality 


The Fenchel conjugate of a function f is given by 
f*(x*) = sup {(x, 2") — f(x)}. (3.26) 


Even if f is not convex, the Fechel conjugate which is written as a supremum 
over linear functions is always convex. Some rules for computing Fenchel 
duals are summarized in Table 3.1.6. If f is convex and its epigraph (3.3) is 
a closed convex set, then f** = f. If f and f* are convex, then they satisfy 
the so-called Fenchel- Young inequality 


f(x) + f*(x*) > (x,2*) for all x, 2*. (a.27) 


3.1 Preliminaries 99 


Fig. 3.4. A convex function is always lower bounded by its first order Taylor ap- 
proximation. This is true even if the function is not differentiable (see Figure 3.5) 


Fig. 3.5. Geometric intuition of a subgradient. The nonsmooth convex function 
(solid blue) is only subdifferentiable at the “kink” points. We illustrate two of its 
subgradients (dashed green and red lines) at a “kink” point which are tangential to 
the function. The normal vectors to these lines are subgradients. Observe that the 
first order Taylor approximations obtained by using the subgradients lower bounds 
the convex function. 


This inequality becomes an equality whenever x* € Of(x), that is, 
f(a) + f*(x*) = (a, 2") for all and x* € Of (x). (3.28) 


Strong convexity (Section 3.1.4) and Lipschitz continuity of the gradient 
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Table 3.1. Rules for computing Fenchel Duals 
Scalar Addition If g(x) = f(x) +a then g*(a2*) = f(a") - es 


Function Scaling If a> 0 and g(x) = af(x) then g*(a*) = af*(x*/a). 
Parameter Scaling If a0 and g(x) = f(azx) then g*(a*) = f*(a */a) 

Linear Transformation If A is an invertible matrix then (f 75 =—7*a4 
Shift If g(x) = f(a — x) then g*(a*) = f*(a*) + (x*, 20). 

Sum If g(t) = fi(t) + fo(x) then g*(a*) = 


inf {fi (ay) + fo (vs) s.t. af +05 = a*}. 
Pointwise Infimum If g(x) = inf f(a) then g*(«*) = sup, f}(a*). 


(Section 3.1.5) are related by Fenchel duality according to the following 
lemma, which we state without proof. 


Lemma 3.10 (Theorem 4.2.1 and 4.2.2 [HUL93]) 


(i) If f is o-strongly convex, then f* has a Lipschitz continuous gradient 


with modulus 4 : 


(ii) If f is convex and has a Lipschitz continuous gradient with modulus 
L, then f* is -strongly conven. 


Next we describe some convex functions and their Fenchel conjugates. 
Example 3.3 (Squared Euclidean Norm) Whenever f(x) = 4 \|a||? we 


have f*(x*) = 5 \|x* ||", that is, the squared Euclidean norm is its own con- 
jugate. 


Example 3.4 (Negative Entropy) The Fenchel conjugate of the negative 


entropy (3.15) ts 
(a*) = log }° exp(a?) 


3.1.7 Bregman Divergence 
Let f be a differentiable convex function. The Bregman divergence defined 
by f is given by 

As(x,2') = f(x) — f(a’) —(x-2', Vf (2')). (3.29) 


Also see Figure 3.6. Here are some well known examples. 


Example 3.5 (Square Euclidean Norm) Set f(x) = E\x\l?. Clearly, 
V f(x) =x and therefore 

1 
2 


1 1 
Agia) 3 |||” |||? a i i ) =5 || — ||? 
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Fig. 3.6. f(x) is the value of the function at x, while f(x’)+(x — x’, Vf(x’)) denotes 
the first order Taylor expansion of f around 2’, evaluated at x. The difference 
between these two quantities is the Bregman divergence, as illustrated. 


Example 3.6 (Relative Entropy) Let f be the un-normalized entropy 


fa) = > (a log G; — @;) . (3.30) 


i 
One can calculate V f(x) = logx, where logx is the component wise loga- 
rithm of the entries of x, and write the Bregman divergence 


Agee = S > xi log x; — ye ~ S > xj log a +502; — (x — a log a) 


= pS («tog (=) + x -=,] : 


Example 3.7 (p-norm) Let f be the square p-norm 


2/p 
f(z) = 5 lk = 5 (>) : (3.31) 
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We say that the q-norm is dual to the p-norm whenever 4 ++ = 1. One can 
verify (Problem 3.12) that the i-th component of the gradient V f(x) is 


sign(2;) \a,|P-? 


p-2 
llll, 


Vai f(£) = (3.32) 


The corresponding Bregman divergence is 


1 i LW dWe sign(c') |ai|? 
As(a,2') = 5 Illy — 3 ll’, — o@ -— 2) 


7 


The following properties of the Bregman divergence immediately follow: 


e A;(z, 2’) is convex in a. 

© Ape) > 0. 

e A; may not be symmetric, that is, in general Af(x, 2’) 4 A;(a’, x). 
© VzrA (a, 2') = Vf (x) — VF (2’). 


The next lemma establishes another important property. 


Lemma 3.11 The Bregman divergence (3.29) defined by a differentiable 
convex function f satisfies 


A;(z,y) + Ag(y, z) — Ag(a, z) = (VE(z) - VE (y), 2-9). (3.33) 
Proof 
A;(z,y) + As(y, z) = f(z) — fy) — (2 —y, VE(y)) + Fy) — F(z) — (y— 2, VE le) 


= f(x) — f(z) —(@-y, VF(y)) — (y — 2, VEC) 


3.2 Unconstrained Smooth Convex Minimization 


In this section we will describe various methods to minimize a smooth convex 
objective function. 


3.2.1 Minimizing a One-Dimensional Convex Function 


As a warm up let us consider the problem of minimizing a smooth one di- 
mensional convex function J : R > R in the interval [L, U]. This seemingly 
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Algorithm 3.1 Interval Bisection 


1: Input: L, U, precision ¢ 

2: Set t = 0, ag = L and bb =U 
3: while (b; — a) - J‘(U) > € do 
a if J/(%%) > 0 then 

5: Qt41 = a; and bi41 = ash 
6: else 

i agi = VE and biy1 = be 
s: end if 

9 t=t+1 


1o: end while 
11: Return: uh 


simple problem has many applications. As we will see later, many optimiza- 
tion methods find a direction of descent and minimize the objective function 
along this direction!; this subroutine is called a line search. Algorithm 3.1 
depicts a simple line search routine based on interval bisection. 

Before we show that Algorithm 3.1 converges, let us first derive an im- 
portant property of convex functions of one variable. For a differentiable 
one-dimensional convex function J (3.7) reduces to 


J(w) > J(w') + (w—w’)- I’(w’), (3.34) 


where J’(w) denotes the gradient of J. Exchanging the role of w and w’ in 
(3.34), we can write 


I(w') = J(w) + (w' — w) - J"(w). (3.35) 
Adding the above two equations yields 
(w —w’)-(J’(w) — J’(w’)) > 0. (3.36) 


If w > w’, then this implies that J’(w) > J’(w’). In other words, the gradient 
of a one dimensional convex function is monotonically non-decreasing. 
Recall that minimizing a convex function is equivalent to finding w* such 
that J’(w*) = 0. Furthermore, it is easy to see that the interval bisection 
maintains the invariant J’(a;) < 0 and J'(bh) > 0. This along with the 
monotonicity of the gradient suffices to ensure that w* € (az, b;). Setting 
w = w* in (3.34), and using the monotonicity of the gradient allows us to 


1 If the objective function is convex, then the one dimensional function obtained by restricting 
it along the search direction is also convex (Exercise 3.10). 


104 3 Optimization 
write for any w’ € (az, bz) 
J(w’) — J(w*) < (w’ — w*)- J"(w’) < (bk -— az) - I'(U). (3.37) 
Since we halve the interval (az, b;) at every iteration, it follows that (bs:—at) = 
(U — L)/2'. Therefore 
(U-L)- JU) 
2 , 
for all w’ € (az, bz). In other words, to find an e-accurate solution, that is, 


J(w’) — J(w*) < € we only need log(U — L) + log J’(U) + log(1/e) < t itera- 
tions. An algorithm which converges to an € accurate solution in O(log(1/e)) 


I(w') — I(w*) < (3.38) 


iterations is said to be linearly convergent. 

For multi-dimensional objective functions, one cannot rely on the mono- 
tonicity property of the gradient. Therefore, one needs more sophisticated 
optimization algorithms, some of which we now describe. 


3.2.2 Coordinate Descent 


Coordinate descent is conceptually the simplest algorithm for minimizing a 
multidimensional smooth convex function J : R” > R. At every iteration 
select a coordinate, say 7, and update 


Wey1 = We — MEi- (3.39) 


Here e; denotes the i-th basis vector, that is, a vector with one at the i-th co- 
ordinate and zeros everywhere else, while 7 € R is a non-negative scalar step 
size. One could, for instance, minimize the one dimensional convex function 
J(w; — ne;) to obtain the stepsize 7. The coordinates can either be selected 
cyclically, that is, 1,2,...,n,1,2,...or greedily, that is, the coordinate which 
yields the maximum reduction in function value. 

Even though coordinate descent can be shown to converge if J has a Lip- 
schitz continuous gradient [['192], in practice it can be quite slow. However, 
if a high precision solution is not required, as is the case in some machine 
learning applications, coordinate descent is often used because a) the cost 
per iteration is very low and b) the speed of convergence may be acceptable 
especially if the variables are loosely coupled. 


3.2.8 Gradient Descent 

Gradient descent (also widely known as steepest descent) is an optimization 
technique for minimizing multidimensional smooth convex objective func- 
tions of the form J : R” — R. The basic idea is as follows: Given a location 
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wz at iteration t, compute the gradient VJ(w;), and update 
We = we— mV I (we), (3.40) 


where 7; is a scalar stepsize. See Algorithm 3.2 for details. Different variants 
of gradient descent depend on how 7 is chosen: 


Exact Line Search: Since J(w; —7VJ(w;)) is a one dimensional convex 
function in 7, one can use the Algorithm 3.1 to compute: 


nm = argmin J(w, — 7V J(wz)). (3.41) 
n 


Instead of the simple bisecting line search more sophisticated line searches 
such as the More-Thuente line search or the golden bisection rule can also 
be used to speed up convergence (see [N\V99] Chapter 3 for an extensive 
discussion). 


Inexact Line Search: Instead of minimizing J(w; — 7VJ(w;)) we could 
simply look for a stepsize which results in sufficient decrease in the objective 
function value. One popular set of sufficient decrease conditions is the Wolfe 
conditions 


J(weyi) < S(we) +eim (VI (we), wet1 — wz) (sufficient decrease) (3.42) 
(VJ (wes1), Wi41 — We) > co(VI (we), We41 — we) (curvature) (3.43) 
with 0 < cy < cg < 1 (see Figure 3.7). The Wolfe conditions are also called 


the Armijio-Goldstein conditions. If only sufficient decrease (3.42) alone is 
enforced, then it is called the Armijio rule. 


I< acteptable stepsize —>| Pie i<_*~acceptable stepsize_—_| 


Fig. 3.7. The sufficient decrease condition (left) places an upper bound on the 
acceptable stepsizes while the curvature condition (right) places a lower bound on 
the acceptable stepsizes. 
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Algorithm 3.2 Gradient Descent 


1: Input: Initial point wo, gradient norm tolerance ¢ 
2: Set t =0 

3: while ||V.J(w;)|| > ¢ do 

4: Wt+1 = Wt — mV J (we) 

5 t=t+1 

6: end while 

7; Return: w; 


Decaying Stepsize: Instead of performing a line search at every itera- 
tion, one can use a stepsize which decays according to a fixed schedule, for 
example, m = 1/,/t. In Section 3.2.4 we will discuss the decay schedule and 
convergence rates of a generalized version of gradient descent. 


Fixed Stepsize: Suppose J has a Lipschitz continuous gradient with mod- 
ulus L. Using (3.22) and the gradient descent update wi41 = wz— mV J (wre) 
one can write 


L 
J (we41) < J (w+) + (VJ (we), Wt+1 — Wt) + DY || wept = wl (3.44) 


L 2 
= F(we) = ne || VI(wo)|)? + SE IV (we)? (3.45) 


Minimizing (3.45) as a function of 7 clearly shows that the upper bound on 
J(wt41) is minimized when we set 7 = i which is the fixed stepsize rule. 


Theorem 3.12 Suppose J has a Lipschitz continuous gradient with modu- 
lus L. Then Algorithm 3.2 with a fixed stepsize m = + will return a solution 
wz with ||\VI(wz)|| < € in at most O(1/e?) iterations. 


Proof Plugging in m = 7 and rearranging (3.45) obtains 
1 
OE IV I(we)|? < T(we) — F(west) (3.46) 


Summing this inequality 
1 T 
SE DNV I (wa)|I? < T(wo) — J(wr) < J(wo) — T(w*), 
t=0 


which clearly shows that ||VJ(wz:)|| > 0 as t > oo. Furthermore, we can 
write the following simple inequality: 


IVJ(wr)|| s yf ero) — ew") 
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Solving for 


=€ 


(ee. ~ J(w*)) 
T+1 


shows that T is O(1/e?) as claimed. a 


If in addition to having a Lipschitz continuous gradient, if J is o-strongly 
convex, then more can be said. First, one can translate convergence in 
||V J (we) || to convergence in function values. Towards this end, use (3.17) to 
write 


Fue) < Tw") + 5 [IVT IP 


Therefore, it follows that whenever ||VJ(wz:)|| < € we have J(w;) — J(w*) < 
e/2c. Furthermore, we can strengthen the rates of convergence. 


Theorem 3.13 Assume everything as in Theorem 8.12. Moreover assume 
that J is o-strongly convex, and let c := 1— ¢. Then J(w;) — J(w*) < € 
after at most 


log((J (wo) — J(w*))/€) 
log(1/c) 


(3.47) 


iterations. 
Proof Combining (3.46) with || V7 (wz) ||” > 20( J(u) — J(w*)), and using 
the definition of c one can write 

c(T (we) — J(w")) = T(wey1) — J(w"). 
Applying the above equation recursively 

c(J(wo) — J(w*)) = T(wr) — Jw’). 
Solving for 

=" (J(wo) — J(w*)) 


and rearranging yields (3.47). | 


When applied to practical problems which are not strongly convex gra- 
dient descent yields a low accuracy solution within a few iterations. How- 
ever, as the iterations progress the method “stalls” and no further increase 
in accuracy is obtained because of the O(1/e?) rates of convergence. On 
the other hand, if the function is strongly convex, then gradient descent 
converges linearly, that is, in O(log(1/e)) iterations. However, the number 
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of iterations depends inversely on log(1/c). If we approximate log(1/c) = 
—log(1—o/L) = o/L, then it shows that convergence depends on the ratio 
L/c. This ratio is called the condition number of a problem. If the problem 
is well conditioned, i.e., 0 ~ L then gradient descent converges extremely 
fast. In contrast, if o < L then gradient descent requires many iterations. 
This is best illustrated with an example: Consider the quadratic objective 
function 


1 
Je) = zu Aw — bw, (3.48) 


where A € R”*” is a symmetric positive definite matrix, and b € R” is any 
arbitrary vector. 

Recall that a twice differentiable function is o-strongly convex and has a 
Lipschitz continuous gradient with modulus L if and only if its Hessian sat- 
isfies LI > V?J(w) = oI (see (3.14) and (3.21)). In the case of the quadratic 
function (3.48) V?J(w) = A and hence o = Amin and L = Amax, where Amin 
(respectively Amax) denotes the minimum (respectively maximum) eigen- 
value of A. One can thus change the condition number of the problem by 
varying the eigen-spectrum of the matrix A. For instance, if we set A to 
the n x n identity matrix, then Amax = Amin = 1 and hence the problem is 
well conditioned. In this case, gradient descent converges very quickly to the 
optimal solution. We illustrate this behavior on a two dimensional quadratic 
function in Figure 3.8 (right). 

On the other hand, if we choose A such that Amax “> Amin then the 
problem (3.48) becomes ill-conditioned. In this case gradient descent exhibits 
zigzagging and slow convergence as can be seen in Figure 3.8 (left). Because 
of these shortcomings, gradient descent is not widely used in practice. A 
number of different algorithms we described below can be understood as 
explicitly or implicitly changing the condition number of the problem to 
accelerate convergence. 


3.2.4 Mirror Descent 


One way to motivate gradient descent is to use the following quadratic ap- 
proximation of the objective function 


6. =I es: sw nant saiy, (GAD 


where, as in the previous section, VJ(-) denotes the gradient of J. Mini- 
mizing this quadratic model at every iteration entails taking gradients with 


3.2 Unconstrained Smooth Convex Minimization 109 


Fig. 3.8. Convergence of gradient descent with exact line search on two quadratic 
problems (3.48). The problem on the left is ill-conditioned, whereas the problem 
on the right is well-conditioned. We plot the contours of the objective function, 
and the steps taken by gradient descent. As can be seen gradient descent converges 
fast on the well conditioned problem, while it zigzags and takes many iterations to 
converge on the ill-conditioned problem. 


respect to w and setting it to zero, which gives 
w— wy := —VI(urz). (3.50) 


Performing a line search along the direction —VJ(w;) recovers the familiar 
gradient descent update 


Wt+1 = Wt — mV J (wr). (3.51) 


The closely related mirror descent method replaces the quadratic penalty 
in (3.49) by a Bregman divergence defined by some convex function f to 
yield 

Qi(w) := J(we) + (VI (we), w — we) + Af(w, wt). (3.52) 


Computing the gradient, setting it to zero, and using Vy»Af¢(w, we) = Vf (w)— 
Vf (wz), the minimizer of the above model can be written as 


Vif(w) — Vf (we) = —VJI (wt). (3:50) 
As before, by using a stepsize 7, the resulting updates can be written as 
Wt4+1 = Vi t(VE (we) = mV J (we))- (3.54) 


It is easy to verify that choosing f(-) = 5 \|-||? recovers the usual gradient 


descent updates. On the other hand if we choose f to be the un-normalized 
entropy (3.30) then V f(-) = log and therefore (3.54) specializes to 


We4y1 = exp(log(wr) — mV J (wz)) = we exp(—m VJ (wz)), (3.55) 


which is sometimes called the Exponentiated Gradient (EG) update. 


110 3 Optimization 


Theorem 3.14 Let J be a convex function and J(w*) denote its minimum 
value. The mirror descent updates (3.54) with a a-strongly convex function 


f satisfy 
A ;(w*, wi) + 35 Doane IVI (we) 
be Mt 
Proof Using the convexity of J (see (3.7)) and (3.54) we can write 
J(w*) > J(we) + (w* — wz, VI(wt)) 


one - (w* — we, F(wesa) — Flwe))- 


= min J (we) — J(w*). 


Now applying Lemma 3.11 and rearranging 
Aj(w*, we) — p(w, wer) + Ap (we, Wer) = me(T (we) — J(u"). 
Summing overt =1,...,7 
Af(w*, wy) — Ap(w*, wai) + S> Ap(we, wey) Ps S > m( J (we) — J(w*)). 
t t 

Noting that As(w*,wr41) > 0, J(we) — J(w*) > ming J(wy) — J(w*), and 
rearranging it follows that 

As (w*, wi) + doe Ap (we, we41) 

i Mt 

Using (3.17) and (3.54) 


= min J (wt) — J(w"). (3.56) 


1 1 
As(we wei) S oo IV F(we) — Vf (wey) ||? = alt IVI(we)|?. (3.57) 


The proof is completed by plugging in (3.57) into (3.56). | 


Corollary 3.15 Jf J has a Lipschitz continuous gradient with modulus L, 
and the stepsizes rn, are chosen as 


\/2a0A¢(w*,w 1) 1 
mh = 7 ) Fi then (3.58) 


2A i 1 
ih Te gee ne 2 
1<t<T Oo VT 


Proof Since VJ is Lipschitz continuous 
(w*, wi) + de Doe mL? 
yee 


Ay 
eee 2 
Bia, Te) — Tu") = 
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Plugging in (3.58) and using Problem 3.15 


. * As(w*, w1) (1+ >>, 5) 
= < < : 
oo eee 2a i - 20 VT 


3.2.5 Conjugate Gradient 


Let us revisit the problem of minimizing the quadratic objective function 
(3.48). Since VJ(w) = Aw—5, at the optimum VJ(w) = 0 (see Lemma 3.6) 
and hence 


Aw = b. (3.59) 


In fact, the Conjugate Gradient (CG) algorithm was first developed as a 
method to solve the above linear system. 

As we already saw, updating w along the negative gradient direction may 
lead to zigzagging. Therefore CG uses the so-called conjugate directions. 


Definition 3.16 (Conjugate Directions) Non zero vectors py and py are 
said to be conjugate with respect to a symmetric positive definite matrix A 
if pj Apt =VatSt. 


Conjugate directions {po,...,Pn—1} are linearly independent and form a 
basis. To see this, suppose the p;’s are not linearly independent. Then there 
exists non-zero coefficients o; such that >>, o1p; = 0. The p;’s are conjugate 
directions, therefore p/, A(>>, o#pt) = D2, oP, Apt = ov'p}) Apy = 0 for all ¢’. 
Since A is positive definite this implies that a, = 0 for all t’, a contradiction. 
As it turns out, the conjugate directions can be generated iteratively as 
follows: Starting with any wo € R” define pp = —go = b — Aw, and set 


Gt Pt 
= sag (3.60a) 
Wet = We + Oepe (3.60b) 
G41 = Avis — 0 (3.60c) 
Br = SerAne (3.60d) 
Pi Apt 
Pt = —9t41 + BeyiPt (3.60e) 
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The following theorem asserts that the p; generated by the above procedure 
are indeed conjugate directions. 


Theorem 3.17 Suppose the t-th iterate generated by the conjugate gradient 
method (3.60) is not the solution of (3.59), then the following properties 
hold: 


span{go, 1,-.-, 91} = span{go, Ago,..., A°go}- (3.61) 
span{po,pi,--., pe} = span{go, Ago, -.., A’ go}- (3.62) 
Pp; gt =0 for all j <t (3.63) 

p} Api = 0 for all j <t. (3.64) 


Proof The proof is by induction. The induction hypothesis holds trivially 
at t = 0. Assuming that (3.61) to (3.64) hold for some t, we prove that they 
continue to hold for t + 1. 


Step 1: We first prove that (3.63) holds. Using (3.60c), (3.60b) and (3.60a) 
Pj 941 = Pj} 
_ a 
— Dj (4m = aed Apt — s) 
P t 


aT P; Api T 
= Pj 9t — FT Gt Pt 
: Pt Apt 


Au; + tpt — b) 


For j = t, both terms cancel out, while for 7 < t both terms vanish due to 
the induction hypothesis. 


Step 2: Next we prove that (3.61) holds. Using (3.60c) and (3.60b) 
OH = Awts1 —b= Aw,+ a@Ap — b= gy + Apr. 


By our induction hypothesis, g; € span{go, Ago,...,A*%go}, while Ap, € 
span{ Ago, A?go,..., A°t4go}. Combining the two we conclude that gi+1 € 
span{go, Ago,..., A’*!go}. On the other hand, we already showed that 9441 
is orthogonal to {po, pi,..., pz}. Therefore, g:41 ¢ span{po, pi,--.-, pz}. Thus 
our induction assumption implies that g:41 ¢ span{go, Ago,.--, A’go}. This 
allows us to conclude that span{go, 91,---, 941} = span{go, Ago,..., A°*go}. 
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Step 3 We now prove (3.64) holds. Using (3.60e) 


ae T 
Pir Ap; = —9¢41 AD; + Br+1p/ Ap;- 


By the definition of 6;,1 (3.60d) the above expression vanishes for j = t. For 
j <t, the first term is zero because Ap; € span{po, p1,.--,pj+1}, a subspace 
orthogonal to g+1 as already shown in Step 1. The induction hypothesis 
guarantees that the second term is zero. 


Step 4 Clearly, (3.61) and (3.60e) imply (3.62). This concludes the proof. 


A practical implementation of (3.60) requires two more observations: 
First, using (3.60e) and (3.63) 
—9 Pe = oF 9 — Big Pe-1 = 94 9 
Therefore (3.60a) simplifies to 


4k 


"pt Ape 
Second, using (3.60c) and (3.60b) 


(3.65) 


H4+1 — 9 = A(wer1 — We) = a Ape. 


But g € span{po,..., pz}, a subspace orthogonal to g;41 by (3.63). Therefore 
Oh APt = ae (Ge419841)- Substituting this back into (3.60d) and using (3.65) 
yields 

= 

Gt419t+1 
ba =. (3.66) 

Gt Gt 
We summarize the CG algorithm in Algorithm 3.3. Unlike gradient descent 
whose convergence rates for minimizing the quadratic objective function 
(3.48) depend upon the condition number of A, as the following theorem 
shows, the CG iterates converge in at most n steps. 


Theorem 3.18 The CG iterates (3.60) converge to the minimizer of (3.48) 
after at most n steps. 


Proof Let w denote the minimizer of (3.48). Since the p;’s form a basis 
Ww — Wo = 90Po +--+. + On—-1Pn-1; 


for some scalars o;. Our proof strategy will be to show that the coefficients 
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Algorithm 3.3 Conjugate Gradient 
1: Input: Initial point wo, residual norm tolerance € 
2: Set t = 0, go = Awo — b, and pp = —Go 


— Gt Gt 
40 Of = SSS 
opi Apt 
5: Wel = Wt + Aept 
6 Gtt41 = Ge + HAD, 
T 
I¢419t+1 
7: gp eae 
Bs I 9 
> Pte = —Gt41 + Berit 
9: t=t+1 


10: end while 
11: Return: w; 


o, coincide with a, defined in (3.60a). Towards this end premultiply with 
p, A and use conjugacy to obtain 


pi A(w — wo) 


(3.67) 
pt Apt 


i= 
On the other hand, following the iterative process (3.60b) from wo until w; 
yields 
Wt— Wo = Aopo +... + At-1Pt-1- 
Again premultiplying with p/ A and using conjugacy 
p;, A(wt — wo) = 0. (3.68) 
Substituting (3.68) into (3.67) produces 


a, =< PEAtw am) _ _ git 
pt Api p, Apt’ 


thus showing that o4 = ay. | 


Observe that the g:;; computed via (3.60c) is nothing but the gradient of 
J(w+41). Furthermore, consider the following one dimensional optimization 
problem: 


min d;(a) = J(we + apr). 
aceR 


Differentiating ¢; with respect to a 


(a) = pp (Awe + a Ape — 6) = p} (94 + @Ap). 
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I Pt 
pi Apt 
words, every iteration of CG minimizes J(w) along a conjugate direction p;. 
Contrast this with gradient descent which minimizes J(w) along the negative 


gradient direction g; at every iteration. 


The gradient vanishes if we set a = — , which recovers (3.60a). In other 


It is natural to ask if this idea of generating conjugate directions and 
minimizing the objective function along these directions can be applied to 
general convex functions. The main difficulty here is that Theorems 3.17 and 
3.18 do not hold. In spite of this, extensions of CG are effective even in this 
setting. Basically the update rules for g; and py remain the same, but the 
parameters a; and (2; are computed differently. Table 3.2 gives an overview 
of different extensions. See [NW99, Lue84] for details. 


Table 3.2. Non-Quadratic modifications of Conjugate Gradient Descent 


Generic Method Compute Hessian K;, := V?J(w;) and update a; 


and 6, with 
oe T 
I+ Pt _ Ge41 Kee 
a, = — and 6; = — 
t pi Kupe Be p? Kipe 
Oy Ge44 
Fletcher-Reeves Set a; = argmin, J(w; + ap;) and 6; = aa 
t 


Polak-Ribiére Set ay = argmin, J(w;+apz), ye = gti — gt, and 
B = Ye Ge+4 
7 94 9t . sa, 
In practice, Polak-Ribiére tends to be better than 


Fletcher-Reeves. 


Hestenes-Stiefel Set a; = argmin, J(w:+apz), Ye = 9i+1 — Gt, and 


= 
Be — Yt Jt+1 
Ye Pt 


3.2.6 Higher Order Methods 


Recall the motivation for gradient descent as the minimizer of the quadratic 
model 


Qrlw) = J(we) + (VI (wi), w = we) + 5(w = wn)" (w= 0), 


The quadratic penalty in the above equation uniformly penalizes deviation 
from w; in different dimensions. When the function is ill-conditioned one 
would intuitively want to penalize deviations in different directions differ- 
ently. One way to achieve this is by using the Hessian, which results in the 
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Algorithm 3.4 Newton’s Method 


1: Input: Initial point wo, gradient norm tolerance € 

2: Set t =0 

3: while ||VJ(w;)|| > € do 

4: Compute p := —V?2J(w:)7!V J(u) 

5: Compute m = argmin, J(w:+ npt) e.g-, via Algorithm 3.1. 
6 Wey = We + Dt 

7z t=t+l1l 

s: end while 

9: Return: w; 


following second order Taylor approximation: 


1 
Qi(w) := J(we) + (VI (we), w — we) + gw — wy) V2 T (wy) (w — we). 
(3.70) 
Of course, this requires that J be twice differentiable. We will also assume 
that J is strictly convex and hence its Hessian is positive definite and in- 


vertible. Minimizing Q; by taking gradients with respect to w and setting it 
zero obtains 


w— wy = —V72F (uy) 1 VS (we), (3.71) 


Since we are only minimizing a model of the objective function, we perform 
a line search along the descent direction (3.71) to compute the stepsize 7, 
which yields the next iterate: 


Wt+1 = Wt — mV2I(w) 'V I (ut). (3.72) 


Details can be found in Algorithm 3.4. 

Suppose w* denotes the minimum of J(w). We say that an algorithm 
exhibits quadratic convergence if the sequences of iterates {w;,} generated 
by the algorithm satisfies: 


llega — wl] SC fox — wall? (3.73) 


for some constant C' > 0. We now show that Newton’s method exhibits 
quadratic convergence close to the optimum. 


Theorem 3.19 (Quadratic convergence of Newton’s Method) Suppose 


J is twice differentiable, strongly conver, and the Hessian of J is bounded 
and Lipschitz continuous with modulus M in a neighborhood of the so- 
lution w*. Furthermore, assume that |V27(w) + || < N. The iterations 
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Wet = wz — V2I (wz) VI (wz) converge quadratically to w*, the minimizer 
of J. 


Proof First notice that 
1 
VJ(w:) — VJ (w*) = [ V7 I (wy + t(w* — wy) (we — w*)dt. (3.74) 
0 
Next using the fact that V?.J(w;) is invertible and the gradient vanishes at 
the optimum (VJ(w*) = 0), write 


Wi — Ww = wy - w* — V2 J(u) VI (we) 


= V7 F (wz) 1 [V2 F (wz) (w, — w*) — (VI(u,) — VI(w*))). (3.75) 
Using (3.75), (3.74), and the Lipschitz continuity of V?.J 
| VJ (we) — VI (w*) — V7 (wi) (wi — w*)|| 


[st + t(w, — w*)) — V2 (wy) (we — wa 


1 
< | || [V7 J (w; + t(w; — w*)) — V?F(wy)]|| || (we — w*)|| at 
0 


; M 
2 lliie w'l? f Mtdt =~ |joe—w"[P. (3.76) 
0 
Finally use (3.75) and (3.76) to conclude that 
M NM 
wera — wl] < > [VF (we) || lle — w*| < lle — "I. 


Newton’s method as we described it suffers from two major problems. 
First, it applies only to twice differentiable, strictly convex functions. Sec- 
ond, it involves computing and inverting of the n x n Hessian matrix at 
every iteration, thus making it computationally very expensive. Although 
Newton’s method can be extended to deal with positive semi-definite Hes- 
sian matrices, the computational burden often makes it unsuitable for large 
scale applications. In such cases one resorts to Quasi-Newton methods. 


8.2.6.1 Quasi-Newton Methods 


Unlike Newton’s method, which computes the Hessian of the objective func- 
tion at every iteration, quasi-Newton methods never compute the Hessian; 
they approximate it from past gradients. Since they do not require the ob- 
jective function to be twice differentiable, quasi-Newton methods are much 
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Fig. 3.9. The blue solid line depicts the one dimensional convex function J(w) = 
w* + 20w? + w. The green dotted-dashed line represents the first order Taylor 
approximation to J(w), while the red dashed line represents the second order Taylor 
approximation, both evaluated at w = 2. 


more widely applicable. They are widely regarded as the workhorses of 
smooth nonlinear optimization due to their combination of computational ef- 
ficiency and good asymptotic convergence. The most popular quasi-Newton 
algorithm is BFGS, named after its discoverers Broyde, Fletcher, Goldfarb, 
and Shanno. In this section we will describe BFGS and its limited memory 
counterpart LBFGS. 

Suppose we are given a smooth (not necessarily strictly) convex objective 
function J : R” — R and a current iterate w; € R”. Just like Newton’s 
method, BFGS forms a local quadratic model of the objective function, J: 


Oth = Tia es: sw seh iy. an) 


Unlike Newton’s method which uses the Hessian to build its quadratic model 
(3.70), BFGS uses the matrix H; > 0, which is a positive-definite estimate 
of the Hessian. A quasi-Newton direction of descent is found by minimizing 


Qi(w): 
w— wu, = —H, VJ (uw). (3.78) 


The stepsize 7 > 0 is found by a line search obeying the Wolfe conditions 
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(3.42) and (3.43). The final update is given by 
wi = we — mH, VI (ur). (3.79) 
Given w;;1 we need to update our quadratic model (3.77) to 
Qryi(w) = I(werr) + (VI (wey1), w — Wey) + 5 ( — wer)! Heya (w — wey). 
(3.80) 


When updating our model it is reasonable to expect that the gradient of 
Q:41 should match the gradient of J at w; and w;+1. Clearly, 


VQiti(w) = VI(wey1) + Ae4i(w — wey), (3.81) 


which implies that VQ++41(wi41) = VJ(wr41), and hence our second con- 
dition is automatically satisfied. In order to satisfy our first condition, we 


require 
VQ141(wt) = V J (wey) + Ay y1 (wt _ Wt+1) = VJ (wt). (3.82) 
By rearranging, we obtain the so-called secant equation: 


At418t = Yt; (3.83) 


where s¢ := wi41 — wz and y := VJ (wr41) — VJ (we) denote the most recent 
step along the optimization trajectory in parameter and gradient space, 
respectively. Since H;11 is a positive definite matrix, pre-multiplying the 
secant equation by s; yields the curvature condition 


si y > 0. (3.84) 


If the curvature condition is satisfied, then there are an infinite number 
of matrices H;,, which satisfy the secant equation (the secant equation 
represents n linear equations, but the symmetric matrix H;+, has n(n+1)/2 
degrees of freedom). To resolve this issue we choose the closest matrix to 
Hi which satisfies the secant equation. The key insight of the BFGS comes 
from the observation that the descent direction computation (3.78) involves 
the inverse matrix B; := Vs Omg Therefore, we choose a matrix By44 := Hey 
such that it is close to B; and also satisfies the secant equation: 


min|| B — Bi (3.85) 
s.t. B= B! and By = 5. (3.86) 
If the matrix norm ||-|| is appropriately chosen [N99], then it can be shown 


that 


Bes = (1—pesey! )Be(1 —pryes? ) + pesesy , (3.87) 


120 3 Optimization 


Algorithm 3.5 LBFGS 
1: Input: Initial point wo, gradient norm tolerance € > 0 
2: Set t=O and By =I 
3: while ||VJ(w;)|| > € do 
4: Ppe= —BV J(u) 
: Find that obeys (3.42) and (3.43) 


5 
6 St = Pt 

7 Weep = Wt + St 

8 Yt c= VJ (wey) aa VJ(wz) 


9: ift=0: By:= +“I 
Ut Yt 


10: pe = (8, 4)! 

uu: Baya = (1 — pesiy! ) Bi — pryes! ) + prses, 
v2: t=t+l1 

13: end while 

14: Return: w; 


where p; := (y' s:)~!. In other words, the matrix B; is modified via an 
incremental rank-two update, which is very efficient to compute, to obtain 
Bist. 

There exists an interesting connection between the BFGS update (3.87) 
and the Hestenes-Stiefel variant of Conjugate gradient. To see this assume 
that an exact line search was used to compute w;41, and therefore a) Ve (ia) = 
0. Furthermore, assume that B; = 1, and use (3.87) to write 


yy VI (wet) 


Pep = — Bry VI (wes) = —VI (wey) + ae 
t ot 


St, (3.88) 
which recovers the Hestenes-Stiefel update (see (3.60e) and Table 3.2). 
Limited-memory BFGS (LBFGS) is a variant of BFGS designed for solv- 
ing large-scale optimization problems where the O(d?) cost. of storing and 
updating B; would be prohibitively expensive. LBFGS approximates the 
quasi-Newton direction (3.78) directly from the last m pairs of s; and y, via 
a matrix-free approach. This reduces the cost to O(md) space and time per 
iteration, with m freely chosen. Details can be found in Algorithm 3.5. 


8.2.6.2 Spectral Gradient Methods 


Although spectral gradient methods do not use the Hessian explicitly, they 
are motivated by arguments very reminiscent of the Quasi-Newton methods. 
Recall the update rule (3.79) and secant equation (3.83). Suppose we want 
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a very simple matrix which approximates the Hessian. Specifically, we want 
Ary = Ot411 (3.89) 


where a;+1 is a scalar and J denotes the identity matrix. Then the secant 
equation (3.83) becomes 


O++15t = Ut- (3.90) 


In general, the above equation cannot be solved. Therefore we use the az+1 
which minimizes ||a74.15; — y:||” which yields the Barzilai-Borwein (BB) step- 
size 


Ss 
Q441 = ore, (3.91) 


As it turns out, az41 lies between the minimum and maximum eigenvalue of 
the average Hessian in the direction s;, hence the name Spectral Gradient 
method. The parameter update (3.79) is now given by 


1 
Wt4+1 = Wt — a, (we): (3.92) 
t 


A practical implementation uses safeguards to ensure that the stepsize az41 
is neither too small nor too large. Given 0 < Qmin < (max < OO we compute 


a 
Q441 = min (eins max (ni “H)) . (3.93) 


One of the peculiar features of spectral gradient methods is their use 
of a non-monotone line search. In all the algorithms we have seen so far, 
the stepsize is chosen such that the objective function J decreases at every 
iteration. In contrast, non-monotone line searches employ a parameter M > 
1 and ensure that the objective function decreases in every M iterations. Of 
course, setting MM = 1 results in the usual monotone line search. Details can 
be found in Algorithm 3.6. 


3.2.7 Bundle Methods 


The methods we discussed above are applicable for minimizing smooth, con- 
vex objective functions. Some regularized risk minimization problems involve 
a non-smooth objective function. In such cases, one needs to use bundle 
methods. In order to lay the ground for bundle methods we first describe 
their precursor the cutting plane method [Kel60]. Cutting plane method is 
based on a simple observation: A convex function is bounded from below by 
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te 


0. OSS or oe oe 


a 
2 


Ids 


Input: wo, M> 1, Qmax > Qmin > 0, VS (0,1), 1 
ao € aa Onadel ande>0 

: Initialize: t = 0 

: while ||VJ(w;)|| > € do 


A=1 
while TRUE do 
dy =-—2VI(wr) 
we = w+ Ad: 
6 = (dt, VJ(we)) 
if J(w+) < ming<j<min(t,M—1) J (2-3) + yAd then 
Wel = Wy 
St = Wt41 — Wt 
ye = VI (wey) — VI (we) 
break 
else 
Atm = —5A76/(J(w) — J(wy) — Ad) 
if \imp > 71 and Atmp < o2A then 
A= Atmp 
else 
A= A/2 
end if 
end if 
end while 


0441 = min(Qmax, Max(Qmin, a )) 
t=t+l1 


25: end while 
26: Return: w; 


> 02 > 01 > 0, 


its linearization (7.e., first order Taylor approximation). See Figures 3.4 and 
3.5 for geometric intuition, and recall (3.7) and (3.13): 


Given subgradients $1, s2,.. 


J(w) > J(w') + (w—w',s") Vw and s’ € OJ(w’). (3.94) 


., 5¢ evaluated at locations wo, wy 1,...,We-1, we 


can construct a tighter (piecewise linear) lower bound for J as follows (also 
see Figure 3.10): 


J(w) > JEP (w) = max {J (wi-1) + (w — wi-1, si) }. (3.95) 


1<i< 
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Given iterates {wi }i6, the cutting plane method minimizes J£P to obtain 


the next iterate wt: 
w; := argmin Je? (w). (3.96) 
WwW 
This iteratively refines the piecewise linear lower bound J©? and allows us 
to get close to the minimum of J (see Figure 3.10 for an illustration). 

If w* denotes the minimizer of J, then clearly each J(w;) > J(w*) and 
hence ming<j<; J(w;) > J(w*). On the other hand, since J > J&P it fol- 
lows that J(w*) > J&P(w;). In other words, J(w*) is sandwiched between 
ming<j<t J(w;) and JEP (w;) (see Figure 3.11 for an illustration). The cutting 
plane method monitors the monotonically decreasing quantity 

—_ = .) _ yCP 
ep = oe J," (we), (3.97) 
and terminates whenever €; falls below a predefined threshold ¢. This ensures 
that the solution J(w;) is €« optimum, that is, J(w;) < J(w*) +. 


; 


Ww 7 
Fig. 3.10. A convex function (blue solid curve) is bounded from below by its lin- 
earizations (dashed lines). The gray area indicates the piecewise linear lower bound 
obtained by using the linearizations. We depict a few iterations of the cutting plane 
method. At each iteration the piecewise linear lower bound is minimized and a new 


linearization is added at the minimizer (red rectangle). As can be seen, adding more 
linearizations improves the lower bound. 


Although cutting plane method was shown to be convergent [I<cl60], it is 
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Fig. 3.11. A convex function (blue solid curve) with four linearizations evaluated at 
four different locations (magenta circles). The approximation gap €3 at the end of 
fourth iteration is indicated by the height of the cyan horizontal band i.e., difference 
between lowest value of J(w) evaluated so far and the minimum of JPP(w) (red 
diamond). 


well known (see e.g., [LNN95, Bel05]) that it can be very slow when new 
iterates move too far away from the previous ones (i.e., causing unstable 
“zig-zag” behavior in the iterates). In fact, in the worst case the cutting 
plane method might require exponentially many steps to converge to an € 
optimum solution. 

Bundle methods stabilize CPM by augmenting the piecewise linear lower 
(e.g., JLP(w) in (3.95)) with a prox-function (i.e., proximity control func- 
tion) which prevents overly large steps in the iterates [Kiw90]. Roughly 
speaking, there are 3 popular types of bundle methods, namely, proximal 
[Kiw90], trust region [SZ92], and level set [LNN95]. All three versions use 
5 I-41? as their prox-function, but differ in the way they compute the new 
iterate: 


proximal: w; := argmin{ lw — e_1 ||? + JCP (w)}, (3.98) 
WwW 
‘on: _ . oy 7OP 1 ~ 42 
trust region: w, := argmin{J,;“ (w) | 5 ||w — w_1||" < Ke}, (3.99) 
WwW 
l ‘ — . ol “ 2 CP 
evel set: wy := argmin{ 5 ||w — will” | Ie* (w) < m}, (8.100) 
WwW 
where wW_1 is the current prox-center, and ¢, k:, and % are positive trade- 
off parameters of the stabilization. Although (3.98) can be shown to be 


equivalent to (3.99) for appropriately chosen ¢; and ky, tuning ¢ is rather 
difficult while a trust region approach can be used for automatically tuning 
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kz. Consequently the trust region algorithm BT of [SZ92] is widely used in 
practice. 


3.3 Constrained Optimization 


So far our focus was on unconstrained optimization problems. Many ma- 
chine learning problems involve constraints, and can often be written in the 
following canonical form: 


min J(w) (3.101a) 
s.t. c(w) <0 fori EI (3.101b) 
ei(w) = 0 forie E (3.101c) 


where both c; and e; are convex functions. We say that w is feasible if and 
only if it satisfies the constraints, that is, c;(w) < 0 for i € J and e;(w) = 0 
for 7 € €. 

Recall that w is the minimizer of an unconstrained problem if and only if 
|V J(w)|| = 0 (see Lemma 3.6). Unfortunately, when constraints are present 
one cannot use this simple characterization of the solution. For instance, the 
w at which ||VJ(w)|| = 0 may not be a feasible point. To illustrate, consider 
the following simple minimization problem (see Figure 3.12): 


1 
min —w? (3.102a) 
w 2 


s.t. 1<w<2. (3.102b) 


Clearly, su 


is minimized at w = 0, but because of the presence of the con- 
straints, the minimum of (3.102) is attained at w = 1 where VJ(w) = w is 
equal to 1. Therefore, we need other ways to detect convergence. In Section 
3.3.1 we discuss some general purpose algorithms based on the concept of or- 
thogonal projection. In Section 3.3.2 we will discuss Lagrange duality, which 
can be used to further characterize the solutions of constrained optimization 


problems. 


3.3.1 Projection Based Methods 


Suppose we are interested in minimizing a smooth convex function of the 
following form: 


min J(w), (3.103) 


126 3 Optimization 


J(w) 


6 A = 0 
w 
Fig. 3.12. The unconstrained minimum of the quadratic function sw is attained 


at w = 0 (red circle). But, if we enforce the constraints 1 < w < 2 (illustrated by 
the shaded area) then the minimizer is attained at w = 1 (green diamond). 


where {) is a convex feasible region. For instance, 12 may be described by 
convex functions c; and e; as in (3.101). The algorithms we describe in this 
section are applicable when 2 is a relatively simple set onto which we can 
compute an orthogonal projection. Given a point w’ and a feasible region 
Q, the orthogonal projection Po(w’) of w’ on Q is defined as 


Po(w') = argmin ||w! — wll? : (3.104) 
wen 


Geometrically speaking, Po(w’) is the closest point to w’ in Q. Of course, if 
w’ €O then Po(w’) = wu’. 

We are interested in finding an approximate solution of (3.103), that is, 
aw €Q) such that 


J(w) — min J(w) = J(w) — J* <e, (3.105) 

wen 
for some pre-defined tolerance € > 0. Of course, J* is unknown and hence the 
gap J(w) — J* cannot be computed in practice. Furthermore, as we showed 
in Section 3.3, for constrained optimization problems ||VJ(w)|| does not 
vanish at the optimal solution. Therefore, we will use the following stopping 
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Algorithm 3.7 Basic Projection Based Method 
1: Input: Initial point wo € 2, and projected gradient norm tolerance 
e>0 

: Initialize: t = 0 

: while ||Po(w: — VJ(wz)) — wz|| > € do 

Find direction of descent d; 

wey = Po(we + mde) 

t=t+1 

: end while 


ornNtNoaw fF wn 


: Return: w; 


criterion in our algorithms 
|| Pa(we — VI (we)) — wzl| < €. (3.106) 


The intuition here is as follows: If w,— VJ(w:) € Q then Po(u; — 
VJ(wz)) = wz if, and only if, VJ(w;) = 0, that is, w; is the global minimizer 
of J(w). On the other hand, if w,— VJ(w:) € Q but Po(w;— VJ (wz)) = we, 
then the constraints are preventing us from making any further progress 
along the descent direction —VJ(w;) and hence we should stop. 

The basic projection based method is described in Algorithm 3.7. Any 
unconstrained optimization algorithm can be used to generate the direction 
of descent d;. A line search is used to find the stepsize 7. The updated 
parameter w; — jd; is projected onto 2 to obtain wy+1. If d is chosen to 
be the negative gradient direction —VJ(w:), then the resulting algorithm 
is called the projected gradient method. One can show that the rates of 
convergence of gradient descent with various line search schemes is also 
preserved by projected gradient descent. 


3.3.2 Lagrange Duality 
Lagrange duality plays a central role in constrained convex optimization. 


The basic idea here is to augment the objective function (3.101) with a 
weighted sum of the constraint functions by defining the Lagrangian: 


L(w, a, B) = J(w) + > aici(w) + 5 Biei(w) (3.107) 
ie] ic 
for a; > 0 and 6; € R. In the sequel, we will refer to a (respectively () as the 


Lagrange multipliers associated with the inequality (respectively equality) 
constraints. Furthermore, we will call a and 6 dual feasible if and only if 
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a; > 0 and 6; € R. The Lagrangian satisfies the following fundamental 
property, which makes it extremely useful for constrained optimization. 


Theorem 3.20 The Lagrangian (3.107) of (3.101) satisfies 


foe if w is feasible 


max L(w,a, 6) = 
oo otherwise. 


a>0, 
In particular, if J* denotes the optimal value of (3.101), then 


J* = mi L(w, a, 8). 
Ss (w, a, B) 


Proof First assume that w is feasible, that is, c;(w) < 0 for i € J and 
eij(w) = 0 for 2 € E. Since a; > 0 we have 


S aici(w) + 5 Biei(w) < 0, (3.108) 
ied i€€ 
with equality being attained by setting a; = 0 whenever c;(w) < 0. Conse- 
quently, 


max L(w,a, 3) = a J(w) + » aye;(w) + S > iei(w) = J (w) 


$0, 
20,8 ied ie 


whenever w is feasible. On the other hand, if w is not feasible then either 
cy (w) > 0 or e(w) 4 0 for some i’. In the first case simply let aj — oo to 
see that maxg>0,4 L(w, a, 8) — oo. Similarly, when e;(w) # 0 let Bj - co 
if e;(w) > 0 or By + —oo if ey (w) < 0 to arrive at the same conclusion. Ml 


If define the Lagrange dual function 
D(a, 8) = min L(w, a, 8), (3.109) 


for a > 0 and £, then one can prove the following property, which is often 
called as weak duality. 


Theorem 3.21 (Weak Duality) The Lagrange dual function (3.109) sat- 
isfies 

D(a, B) < J(w) 
for all feasible w and a > 0 and £. In particular 


D* := L < L os 3.110 
max, min (w, a, 3) min ns (w, a, B)=J ( ) 
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Proof As before, observe that whenever w is feasible 


a aie; (w) + S> Bye;(w) < 0. 


ied tE€ 


Therefore 


D(a, 6B) = min L(w,a, 8) = min J(w) + > aye (w) + S > Biei(w) < J(w) 
° . ied iee 
for all feasible w and a > 0 and 8. In particular, one can choose w to be 
the minimizer of (3.101) and a > 0 and 6 to be maximizers of D(a, 2) to 
obtain (3.110). a 


Weak duality holds for any arbitrary function, not-necessarily convex. When 
the objective function and constraints are convex, and certain technical con- 
ditions, also known as Slater’s conditions hold, then we can say more. 


Theorem 3.22 (Strong Duality) Supposed the objective function f and 
constraints c; fori € J and e; fori © E in (3.101) are conver and the 
following constraint qualification holds: 


There exists aw such that ¢;(w) <0 for alli € J. 
Then the Lagrange dual function (3.109) satisfies 


DD := in L = mi L oy hae 3.111 
ee (w, a, 8) = min — (w, a, B) (3.111) 
The proof of the above theorem is quite technical and can be found in 
any standard reference (e.g., [BV04]). Therefore we will omit the proof and 
proceed to discuss various implications of strong duality. First note that 
min: Mane L(w, a, B) = max min L(w,a, 8). (3.112) 

In other words, one can switch the order of minimization over w with max- 
imization over a and (. This is called the saddle point property of convex 
functions. 

Suppose strong duality holds. Given any a > 0 and § such that D(a, 3) > 
—oo and a feasible w we can immediately write the duality gap 


J(w) — J* = J(w) — D* < J(w) — D(a, 8), 


where J* and D* were defined in (3.111). Below we show that if w* is primal 
optimal and (a*, 6*) are dual optimal then J(w*) — D(a*,6*) = 0. This 
provides a non-heuristic stopping criterion for constrained optimization: stop 
when J(w) — D(a, () < €, where € is a pre-specified tolerance. 


130 3 Optimization 


Suppose the primal and dual optimal values are attained at w* and 
(a*, 6*) respectively, and consider the following line of argument: 


J(w*) = D(a*, *) (3.113a) 
= min J(w) + S\ age;(w) + 5° Bre;(w) (3.113b) 
ieJ iE € 
< I(w*) + S° afe;(w*) + S> Bres(w") (3.113c) 
iEI tE€ 
< J(w*). (3.113d) 


To write (3.113a) we used strong duality, while (3.113c) obtains by setting 
w = w* in (3.113c). Finally, to obtain (3.113d) we used the fact that w* is 
feasible and hence (3.108) holds. Since (3.113) holds with equality, one can 
conclude that the following complementary slackness condition: 


S > afei(w*) + 5° Bfei(w*) = 0. 


ied tE€ 


In other words, ajc;(w*) = 0 or equivalently a* = 0 whenever c;(w) < 0. 
Furthermore, since w* minimizes L(w,a*,6*) over w, it follows that its 
gradient must vanish at w*, that is, 


VJ(w*) + So as Vc(w*) + > B;Ve(w*) = 0. 


ie ie € 


Putting everything together, we obtain 


c(w*)<0 WET (3.114a) 

ej(w")=0 We € (3.114b) 

at >0 (3.114c) 

azci(w*) = 0 (3.114d) 

VJ (w*) + azVe;(w*) + > Be Vex(w*) = 0. (3.114e) 


ied tE€ 


The above conditions are called the KKT conditions. If the primal problem is 
convex, then the KKT conditions are both necessary and sufficient. In other 
words, if w and (4, 8) satisfy (3.114) then w and (4, 8) are primal and dual 
optimal with zero duality gap. To see this note that the first two conditions 
show that w is feasible. Since a; > 0, L(w,a,() is convex in w. Finally the 
last condition states that # minimizes L(w,d, 8). Since Qc;(w) = 0 and 
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e;(w) = 0, we have 


3.3.8 Linear and Quadratic Programs 


So far we discussed general constrained optimization problems. Many ma- 
chine learning problems have special structure which can be exploited fur- 
ther. We discuss the implication of duality for two such problems. 


8.3.3.1 Linear Programming 


An optimization problem with a linear objective function and (both equality 
and inequality) linear constraints is said to be a linear program (LP). A 
canonical linear program is of the following form: 


min clw (3.115a) 
WwW 
s.t. Aw = b,w > 0. (3.115b) 
Here w and c are n dimensional vectors, while b is a m dimensional vector, 


and A isam xX n matrix with m < n. 
Suppose we are given a LP of the form: 


min clw (3.116a) 
W 
s.t. Aw > b, (3.116b) 
we can transform it into a canonical LP by introducing non-negative slack 
variables 
min cl w (3.117a) 
we 
s.t. Aw -€=0,€ >0. (3.117b) 


Next, we split w into its positive and negative parts wt and w~ respec- 
tively by setting w;’ = max(0,w;) and w; = max(0,—w;). Using these new 
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variables we rewrite (3.117) as 


c ]' [wt 
min —c w (3.118a) 
wtw-,é 0 é 
Tag we 
s.t.[ A -A -I]] w | =6,] w | 20, (3.118b) 
g g 


thus yielding a canonical LP (3.115) in the variables wt, w~ and €. 
By introducing non-negative Lagrange multipliers a and § one can write 
the Lagrangian of (3.115) as 


L(w, 8,s) =c'w+ 6" (Aw—b)-alw. (3.119) 


Taking gradients with respect to the primal and dual variables and setting 
them to zero obtains 


A'B-a=c (3.120a) 
Aw =b (3.120b) 
a'w=0 (3.120c) 
w>0 (3.120d) 

a>0. (3.120e) 


Condition (3.120c) can be simplified by noting that both w and a are con- 
strained to be non-negative, therefore a' w = 0 if, and only if, a;w; = 0 for 
@=1,...,n. 


Using (3.120a), (3.120c), and (3.120b) we can write 


c'w=(A'B—a)'w=f' Aw=f'b. 


Substituting this into (3.115) and eliminating the primal variable w yields 
the following dual LP 


se b'B (3.121a) 
st. A'B-a=c,a>0. (3.121b) 


As before, we let 8* = max(3,0) and 6~ = max(0,—() and convert the 
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above LP into the following canonical LP 


T 


b Br 
max —b > 3.122 
Se le (3.122a) 
a 
Br Bt 
st. [ A’ -A' -I]] B | =c,] B | =O. (3.122b) 
a a 


It can be easily verified that the primal-dual problem is symmetric; by taking 
the dual of the dual we recover the primal (Problem 3.17). One important 
thing to note however is that the primal (3.115) involves n variables and 
n+ m constraints, while the dual (3.122) involves 2m +n variables and 
4m + 2n constraints. 


8.8.3.2 Quadratic Programming 


An optimization problem with a convex quadratic objective function and lin- 
ear constraints is said to be a convex quadratic program (QP). The canonical 
convex QP can be written as follows: 


1 
min su Ge t+wld (3.123a) 
s.t. a) w = bj fori € € (3.123b) 
a) w <b; fori ed (3.123c) 


Here G > 0 isan xn positive semi-definite matrix, € and J are finite set of 
indices, while d and a; are n dimensional vectors, and 6; are scalars. 

As a warm up let us consider the arguably simpler equality constrained 
quadratic programs. In this case, we can stack the a; into a matrix A and 
the 6; into a vector b to write 


1 
min zu Gu +w'd (3.124a) 
s.t. Aw =b (3.124b) 


By introducing non-negative Lagrange multipliers 6 the Lagrangian of the 
above optimization problem can be written as 


Lw,3)= sw Gu a0 de A ~ 8). (3.125) 


To find the saddle point of the Lagrangian we take gradients with respect 
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to w and 6 and set them to zero. This obtains 


Gw+d+A'B=0 
Aw = b. 


Putting these two conditions together yields the following linear system of 


[ao [5 ]-[3'] o.a5 


The matrix in the above equation is called the KKT matrix, and we can use 


equations 


it to characterize the conditions under which (3.124) has a unique solution. 


Theorem 3.23 Let Z be an x (n—m) matrix whose columns form a basis 
for the null space of A, that is, AZ = 0. If A has full row rank, and the 
reduced-Hessian matric Z' GZ is positive definite, then there exists a unique 
pair (w*, B*) which solves (3.126). Furthermore, w* also minimizes (3.124). 


Proof Note that a unique (w*,6*) exists whenever the KKT matrix is 
non-singular. Suppose this is not the case, then there exist non-zero vectors 


a and 6 such that 
G Al a) _ 0 
A O bb] 7 


Since Aa = 0 this implies that a lies in the null space of A and hence there 
exists a u such that a = Zu. Therefore 


[ Zu 01S ales 


Positive definiteness of Z'GZ implies that wu = 0 and hence a = 0. On the 
other hand, the full row rank of A and A'b = 0 implies that b = 0. In 
summary, both a and 0 are zero, a contradiction. 


| =u'Z'GZu=0. 


Let w 4 w* be any other feasible point and Aw = w* — w. Since Aw* = 
Aw = b we have that AAw = 0. Hence, there exists a non-zero u such that 
Aw = Zu. The objective function J(w) can be written as 

1 


Je) = 5h *_ Aw)! G(w* — Aw) + (w* — Aw)! 


1 = 
= I(w*) + sAw'GAw — (Gw* + d)' Aw. 


First note that sAw' GAw a sul Z'GZu > 0 by positive definiteness of 
the reduced Hessian. Second, since w* solves (3.126) it follows that (Gw* + 
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d)' Aw = 8' AAw = 0. Together these two observations imply that J(w) > 
J Cai”) = 
If the technical conditions of the above theorem are met, then solving the 
equality constrained QP (3.124) is equivalent to solving the linear system 
(3.126). See [NW99] for a extensive discussion of algorithms that can be 
used for this task. 

Next we turn our attention to the general QP (3.123) which also contains 
inequality constraints. The Lagrangian in this case can be written as 


ig T T T 
La By = 5u Gu+w d+ De a;(a; w — bj) + Dd Bila; w—0;). (3.127) 
ied ote 
Let w* denote the minimizer of (3.123). If we define the active set A(w*) as 


A(w*) = {i s.t.i€J and a) w* = b,} , 


then the KKT conditions (3.114) for this problem can be written as 


alw—b <0 VieI\A(w*) (3.128a) 

alw—b=0 Vie EUA(w*) (3.128b) 

az >0 Vice A(u*) (3.128c) 

Gw*+d+ S> afar+ 5 > Bia; = 0. (3.128d) 
iC A(w*) ic 


Conceptually the main difficulty in solving (3.123) is in identifying the active 
set A(w*). This is because a¥ = 0 for all i € J \ A(w*). Most algorithms 
for solving (3.123) can be viewed as different ways to identify the active set. 
See [N\V99] for a detailed discussion. 


3.4 Stochastic Optimization 


Recall that regularized risk minimization involves a data-driven optimization 
problem in which the objective function involves the summation of loss terms 
over a set of data to be modeled: 


m 
min J(f) = AO(f) + > UF (ai), ui): 

i=1 
Classical optimization techniques must compute this sum in its entirety for 
each evaluation of the objective, respectively its gradient. As available data 
sets grow ever larger, such “batch” optimizers therefore become increasingly 
inefficient. They are also ill-suited for the incremental setting, where partial 
data must be modeled as it arrives. 
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Stochastic gradient-based methods, by contrast, work with gradient esti- 
mates obtained from small subsamples (mini-batches) of training data. This 
can greatly reduce computational requirements: on large, redundant data 
sets, simple stochastic gradient descent routinely outperforms sophisticated 
second-order batch methods by orders of magnitude. 

The key idea here is that J(w) is replaced by an instantaneous estimate 
J; which is computed from a mini-batch of size k comprising of a subset of 
points (2!,y') with i= 1,...,k drawn from the dataset: 


k 
JIi(w) = AQ(w) + 5 > l(w, xt, y’). (3.129) 


i=l 
Setting k = 1 obtains an algorithm which processes data points as they 
arrive. 


3.4.1 Stochastic Gradient Descent 


Perhaps the simplest stochastic optimization algorithm is Stochastic Gradi- 
ent Descent (SGD). The parameter update of SGD takes the form: 


Wt+1 = Wt — mV Jt(wz)- (3.130) 


If J; is not differentiable, then one can choose an arbitrary subgradient from 
OJ;(w;) to compute the update. It has been shown that SGD asymptotically 
converges to the true minimizer of J(w) if the stepsize 7, decays as O(1/,‘t). 


= = (3.131) 
mh = r+t’ : 


where 7 > 0 is a tuning parameter. See Algorithm 3.8 for details. 


For instance, one could set 


3.4.1.1 Practical Considerations 


One simple yet effective rule of thumb to tune 7 is to select a small subset 
of data, try various values of 7 on this subset, and choose the 7 that most 
reduces the objective function. 
In some cases letting 7 to decay as O(1/t) has been found to be more 
effective: 
= 
we (3.132) 


The free parameter 7 > 0 can be tuned as described above. If Q(w) is o- 


strongly convex, then dividing the stepsize 7 by aX yields good practical 
performance. 
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Algorithm 3.8 Stochastic Gradient Descent 


1: Input: Maximum iterations T,, batch size k, and + 

2: Set t= 0 and wo = 0 

3: while t < T do 

4: Choose a subset of k data points (xt, y!) and compute VJ;(w:) 


5: | Compute stepsize m = ,/=q 
6: Wet = We — MV Iz (we) 

7: t=t+1 

8: end while 

9: Return: wr 


3.5 Nonconvex Optimization 


Our focus in the previous sections was on convex objective functions. Some- 
times non-convex objective functions also arise in machine learning applica- 
tions. These problems are significantly harder and tools for minimizing such 
objective functions are not as well developed. We briefly describe one algo- 
rithm which can be applied whenever we can write the objective function as 
a difference of two convex functions. 


3.5.1 Concave-Convex Procedure 


Any function with a bounded Hessian can be decomposed into the difference 
of two (non-unique) convex functions, that is, one can write 


I(w) = f(w) — g(w), (3.133) 


where f and g are convex functions. Clearly, J is not convex, but there 
exists a reasonably simple algorithm namely the Concave-Convex Procedure 
(CCP) for finding a local minima of J. The basic idea is simple: In the 
t*® iteration replace g by its first order Taylor expansion at w;, that is, 


g(wr) + (w — we, Vg(we)) and minimize 
Ji(w) = f(w) — g(we) — (w — we, Vg(wr)) - (3.134) 
Taking gradients and setting it to zero shows that J; is minimized by setting 
V f(weyi) = Va(wr). (3.135) 


The iterations of CCP on a toy minimization problem is illustrated in Figure 
3.13, while the complete algorithm listing can be found in Algorithm 3.9. 
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Fig. 3.13. Given the function on the left we decompose it into the difference of two 
convex functions depicted on the right panel. The CCP algorithm generates iterates 
by matching points on the two convex curves which have the same tangent vectors. 
As can be seen, the iterates approach the solution x = 2.0. 


Algorithm 3.9 Concave-Convex Procedure 


1: Input: Initial point wo, maximum iterations T’, convex functions f,g 
2: Set t = 0 

3: while t < T do 

4, Set we41 = argmin,, f(w) — g(wy) — (w — wz, Vg(wtz)) 

5 t=t+1 

6: end while 

7; Return: wr 


Theorem 3.24 Let J be a function which can be decomposed into a differ- 
ence of two convex functions e.g., (3.133). The iterates generated by (3.135) 
monotically decrease J. Furthermore, the stationary point of the iterates is 
a local minima of J. 


Proof Since f and g are convex 


f(we) > f (wes) + (we — wes, VF (west) 
g(We+1) = g(we) + (wey — We, Vo(we)) - 


Adding the two inequalities, rearranging, and using (3.135) shows that J(w;) = 
f (we) — g(we) 2 f(wesr) — g(wis1) = J(we41), as claimed. 

Let w* be a stationary point of the iterates. Then Vf(w*) = Vg(wu%*), 
which in turn implies that w* is a local minima of J because VJ(w*) = 0. 
| 


There are a number of extensions to CCP. We mention only a few in the 
passing. First, it can be shown that all instances of the EM algorithm (Sec- 
tion ??) can be shown to be special cases of CCP. Second, the rate of con- 
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vergence of CCP is related to the eigenvalues of the positive semi-definite 
matrix V?(f +g). Third, CCP can also be extended to solve constrained 
problems of the form: 


min fo(w) — go(w) 
s.t. f;(w) — gi(w) < cq; fori =1,...,n. 


where, as before, f; and g; for i = 0,1,...,n are assumed convex. At every 
iteration, we replace g; by its first order Taylor approximation and solve the 
following constrained convex problem: 


min fo(w) — go(we) + (w — wr, Vgo(we)) 
s.t. fi(w) — gi(we) + (w — we, Voi(we)) < G fori =1,...,n. 


3.6 Some Practical Advice 


The range of optimization algorithms we presented in this chapter might be 
somewhat intimidating for the beginner. Some simple rules of thumb can 
alleviate this anxiety 


Code Reuse: Implementing an efficient optimization algorithm correctly 
is both time consuming and error prone. Therefore, as far as possible use 
existing libraries. A number of high class optimization libraries both com- 
mercial and open source exist. 


Unconstrained Problems: For unconstrained minimization of a smooth 
convex function LBFGS (Section 3.2.6.1 is the algorithm of choice. In many 
practical situations the spectral gradient method (Section 3.2.6.2) is also 
very competitive. It also has the added advantage of being easy to imple- 
ment. If the function to be minimized is non-smooth then Bundle methods 
(Section 3.2.7) are to be preferred. Amongst the different formulations, the 
Bundle Trust algorithm tends to be quite robust. 


Constrained Problems: For constrained problems it is very important 
to understand the nature of the constraints. Simple equality (Ax = b) and 
box (J < x < u) constraints are easier to handle than general non-linear 
constraints. If the objective function is smooth, the constraint set ( is simple, 
and orthogonal projections Pg are easy to compute, then spectral projected 
gradient (Section 3.3.1) is the method of choice. If the optimization problem 
is a QP or an LP then specialized solvers tend to be much faster than general 
purpose solvers. 
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Large Scale Problems: If your parameter vector is high dimensional then 
consider coordinate descent (Section 3.2.2) especially if the one dimensional 
line search along a coordinate can be carried out efficiently. If the objective 
function is made up of a summation of large number of terms, consider 
stochastic gradient descent (Section 3.4.1). Although both these algorithms 
do not guarantee a very accurate solution, practical experience shows that 
for large scale machine learning problems this is rarely necessary. 


Duality: Sometimes problems which are hard to optimize in the primal 
may become simpler in the dual. For instance, if the objective function is 
strongly convex but non-smooth, its Fenchel conjugate is smooth with a 
Lipschitz continuous gradient. 


Problems 


Problem 3.1 (Intersection of Convex Sets {1}) [fC, and C2 are con- 
vex sets, then show that Cy MC is also convex. Extend your result to show 
that (\;_, Ci are convex if C; are conver. 


Problem 3.2 (Linear Transform of Convex Sets {1}) Given asetC C 
R” and a linear transform A € R™*", define AC := {y = Av: « € C}. If 
C is conver then show that AC is also convex. 


Problem 3.3 (Convex Combinations {1}) Show that a subset of R” is 
convex if and only if it contains all the convex combination of its elements. 


Problem 3.4 (Convex Hull {2}) Show that the convex hull, conv(X) is 
the smallest convex set which contains X. 


Problem 3.5 (Epigraph of a Convex Function {2}) Show that a func- 
tion satisfies Definition 3.3 if, and only if, its epigraph is conver. 


Problem 3.6 Prove the Jensen’s inequality (3.6). 


Problem 3.7 (Strong convexity of the negative entropy {3}) Show that 
the negative entropy (3.15) is 1-strongly convex with respect to the ||-||, norm 

. A 5 ff 2 
on the simplex. Hint: First show that $(t) := (t — 1) logt — gu4 > 0 for 
allt > 0. Nezt substitute t = x;/y; to show that 


. 
Yo (ei - ys) log = > Ila — yl? 
Yi 


a 
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Problem 3.8 (Strongly Convex Functions {2}) Prove 3.16, 3.17, 3.18 
and 3.19. 


Problem 3.9 (Convex Functions with Lipschitz Continuous Gradient {2}) 
Prove 8.22, 3.23, 3.24 and 3.25. 


Problem 3.10 (One Dimensional Projection {1}) If f : R¢ > R is 
convex, then show that for an arbitrary x and p in R¢ the one dimensional 
function ®(n) := f(a + np) is also convex. 


Problem 3.11 (Quasi-Convex Functions {2}) In Section 3.1 we showed 
that the below-sets of a convex function X. := {x| f(x) < c} are conver. Give 
a counter-example to show that the converse is not true, that is, there exist 
non-convex functions whose below-sets are convex. This class of functions is 
called Quasi-Convez. 


Problem 3.12 (Gradient of the p-norm {1}) Show that the gradient of 
the p-norm (3.31) is given by (3.32). 


Problem 3.13 Derive the Fenchel conjugate of the following functions 


0 ifxrec 
iQ) = ye where C is a convex set 
co. otherwise. 


f(z) =ax+b 

f(z)= eye where A is a positive definite matrix 
f(x) = es ) 

f(x) = ee ) 

f(x) = xlog(x) 


Problem 3.14 (Convergence of gradient descent {2}) Suppose J has 
a Lipschitz continuous gradient with modulus L. Then show that Algorithm 
3.2 with an inexact line search satisfying the Wolfe conditions (3.42) and 
(3.43) will return a solution wz with ||VJ(w:)|| < € in at most O(1/e?) iter- 
ations. 


Problem 3.15 Show that 
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Problem 3.16 (Coordinate Descent for Quadratic Programming {2}) 
Derive a projection based method which uses coordinate descent to generate 
directions of descent for solving the following box constrained QP: 
bai a 
a ha Qu+cw 


stl<w<u. 
You may assume that Q is positive definite andl and wu are scalars. 
Problem 3.17 (Dual of a LP {1}) Show that the dual of the LP (3.122) 


is (3.115). In other words, we recover the primal by computing the dual of 
the dual. 


A 


Online Learning and Boosting 


So far the learning algorithms we considered assumed that all the training 
data is available before building a model for predicting labels on unseen data 
points. In many modern applications data is available only in a streaming 
fashion, and one needs to predict labels on the fly. To describe a concrete 
example, consider the task of spam filtering. As emails arrive the learning 
algorithm needs to classify them as spam or ham. Tasks such as these are 
tackled via online learning. Online learning proceeds in rounds. At each 
round a training example is revealed to the learning algorithm, which uses 
its current model to predict the label. The true label is then revealed to 
the learner which incurs a loss and updates its model based on the feedback 
provided. This protocol is summarized in Algorithm 4.1. The goal of online 
learning is to minimize the total loss incurred. By an appropriate choice 
of labels and loss functions, this setting encompasses a large number of 
tasks such as classification, regression, and density estimation. In our spam 
detection example, if an email is misclassified the user can provide feedback 
which is used to update the spam filter, and the goal is to minimize the 
number of misclassified emails. 


4.1 Halving Algorithm 


The halving algorithm is conceptually simple, yet it illustrates many of the 
concepts in online learning. Suppose we have access to a set of n experts, 
that is, functions f; which map from the input space X to the output space 
Y = {+1}. Furthermore, assume that one of the experts is consistent, that 
is, there exists a 7 € {1,...,n} such that f;(2:) = % fort =1,...,7. The 
halving algorithm maintains a set C; of consistent experts at time t. Initially 


Cy = {1,...,n}, and it is updated recursively as 


Cer = {8 Cy st. filter) = ye}. (4.1) 


The prediction on a new data point is computed via a majority vote amongst 
the consistent experts: % = majority(C;). 


Lemma 4.1 The Halving algorithm makes at most logs(n) mistakes. 
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Algorithm 4.1 Protocol of Online Learning 
1: fort =1,...,7 dodo 


2; Get training instance 2; 
8 Predict label 7% 

4: Get true label y; 

5:  Incur loss 1(%, x, yt) 

6: Update model 

7: end for 


Proof Let M denote the total number of mistakes. The halving algorithm 
makes a mistake at iteration t if at least half the consistent experts C; predict 
the wrong label. This in turn implies that 


IC: — Col — on 
9 = 9M ~ 9M° 


Cri] < 


On the other hand, since one of the experts is consistent it follows that 
1 < |@z41|. Therefore, 2” <n. Solving for M completes the proof. a 


4.2 Weighted Majority 


We now turn to the scenario where none of the experts is consistent. There- 
fore, the aim here is not to minimize the number mistakes but to minimize 
regret. 

In this chapter we will consider online methods for solving the following 
optimization problem: 


T 
i h — . 4.2 
min J(w) where J(w) d fw) (4.2) 
Suppose we have access to a function ~ which is continuously differentiable 
and strongly convex with modulus of strong convexity o > 0 (see Section 
3.1.4 for definition of strong convexity), then we can define the Bregman 
divergence (3.29) corresponding to w as 


Ay(w,w') = ¥(w) — $(w') —(w—w', Vo(w')) . 


We can also generalize the orthogonal projection (3.104) by replacing the 
square Euclidean norm with the above Bregman divergence: 


Pyo(w’) = argmin Ay (w, w’). (4.3) 
wen 
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Algorithm 4.2 Stochastic (sub)gradient Descent 

1: Input: Initial point 7,, maximum iterations T 

2: fort =1,...,7' do 
3: Compute t41 = Vu* (Vud(we) — meget) with gz = Ow f:(we) 
4: Set wer = Py,o (we41) 
5 
6 


: end for 
: Return: wri1 


Denote w* = Pyo(w’). Just like the Euclidean distance is non-expansive, the 
Bregman projection can also be shown to be non-expansive in the following 
sense: 


Ay(w,w’) > Ay(w, w*) + Ay(w*, wv’) (4.4) 
for all w € Q. The diameter of 2. as measured by Ay, is given by 


diam, (Q) = max, Ay(w,w’'). (4.5) 
wyw'e 


For the rest of this chapter we will make the following standard assumptions: 


e Each f; is convex and revealed at time instance t. 

e © is a closed convex subset of R” with non-empty interior. 

e The diameter diamy(Q) of Q is bounded by F’ < oo. 

e The set of optimal solutions of (4.2) denoted by 0* is non-empty. 
e The subgradient 0,,f;(w) can be computed for every t and w € 2. 
e The Bregman projection (4.3) can be computed for every w’ € R”. 
e The gradient V7, and its inverse (Vi~)~! = Vy* can be computed. 


The method we employ to solve (4.2) is given in Algorithm 4.2. Before 
analyzing the performance of the algorithm we would like to discuss three 
special cases. First, Euclidean distance squared which recovers projected 
stochastic gradient descent, second Entropy which recovers Exponentiated 
gradient descent, and third the p-norms for p > 2 which recovers the p-norm 
Perceptron. BUGBUG TODO. 

Our key result is Lemma 4.3 given below. It can be found in various guises 
in different places most notably Lemma 2.1 and 2.2 in [?], Theorem 4.1 and 
Eq. (4.21) and (4.15) in [?], in the proof of Theorem 1 of [?], as well as Lemma 
3 of [?]. We prove a slightly general variant; we allow for projections with 
an arbitrary Bregman divergence and also take into account a generalized 
version of strong convexity of f;. Both these modifications will allow us to 
deal with general settings within a unified framework. 
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Definition 4.2 We say that a convex function f is strongly convex with 
respect to another convex function yw with modulus X if 


f(w) — f(w’) — (w—w', uw) > AAy(w, w’) for all w € Of (w'). (4.6) 


The usual notion of strong convexity is recovered by setting W(-) = 5 lie 


Lemma 4.3 Let ft be strongly convex with respect to w with modulus 4 > 0 
for allt. For any w € Q the sequences generated by Algorithm 4.2 satisfy 


2 
Ay(w, wer) < Ay(w, we) me (ge, we = w) + 2 Ig? (4.7) 


2 
< (1— mA) Ay(w, we) — me(Felwe) — few) + a llgell”- (4.8) 


Proof We prove the result in three steps. First we upper bound A, (w, w;41) 
by Ay(w, +1). This is a consequence of (4.4) and the non-negativity of the 
Bregman divergence which allows us to write 


Ay(w, wet1) < Ay(w, te41). (4.9) 
In the next step we use Lemma 3.11 to write 
Ay (w, we) + Ay (we, B41) — Ay (w, We41) = (Ve (We) — Vo (we), w — wr). 


Since Vy* = (Vw) ~1, the update in step 3 of Algorithm 4.2 can equivalently 
be written as Vw(wi41) — Vv(wi) = —mg:. Plugging this in the above 
equation and rearranging 


Ay(w, W141) = Ay(w, wt) — Mt (9; Wt — w) + Ay (we, W441). (4.10) 


Finally we upper bound A,,(w;, +41). For this we need two observations: 
First, (x,y) < a \|a-||? + 5 lly\|? for all 2,y € R” and o > 0. Second, the o 
strong convexity of w allows us to bound Ay(wi41, w:) > F |lwe — 41/7. 
Using these two observations 


Ay (we, wt41) = ¥(we) — P(Wes1) — (ViP(Wi41), We — We41) 


= —(P(We+1) — Y(we) — (Ve (we), Wep1 — we)) + (nage, We — We41) 


= —Ay (wi, we) + (M9, We — We41) 


2 
o . o P 
= =) |lwe — Bey (|? 4 a IIgll? 4 9 ||we — Wey)? 


ne 2 
—* ‘ 4.11 
oy lgell (4.11) 


Inequality (4.7) follows by putting together (4.9), (4.10), and (4.11), while 
(4.8) follows by using (4.6) with f = f; and w’ = uw; and substituting into 
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(47): a 
Now we are ready to prove regret bounds. 

Lemma 4.4 Let w* € Q* denote the best parameter chosen in hindsight, 
and let ||g:|| < L for allt. Then the regret of Algorithm 4.2 can be bounded 
Uta 


9 T 


ay 
1 L 
ye frlwi) — fiw") < F (— = r,) e— Sone. (4.12) 
nT 20 
t=1 t=1 
Proof Set w = w* and rearrange (4.8) to obtain 


fi(uy) — fale") < = (1 Arp)Ag(to", we) — Ayu", we4a)) + 3 lal? 


Summing over t 


T T a 
ok 1 ok ok 
SFr) — fw") 30 F(A = md) dg" 4) — Aol wens)) + 3 lel? 
e!J[—_“_ 
Ty T> 


Since the diameter of 2 is bounded by F’ and Ay, is non-negative 


i 
1 1 1 1 
T, = (- — a) Ay (w*, wi) — po re + S- Ay(w*, we) ( — — — a) 


t=2 


< (2-2) ay(w*,an) + y Aoi) (~-= -.) 


t=? Nt ™—-1 


T 
< (—-a)F+}or(“-—--a) =F (7a). 
™. 12 Nt ™—-1 nT 


On the other hand, since the subgradients are Lipschitz continuous with 


constant L it follows that 


Putting together the bounds for T; and T> yields (4.12). a 


Corollary 4.5 If \ >0 and we set m% = a then 


T [2 
S> felwe) files sont + log(T)), 
t=1 
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On the other hand, when A = 0, if we set m = a then 


TF 
> filed) — fle") s (F + =) VE. 


Proof First consider \ > 0 with m% = re In this case = = TA, and 
consequently (4.12) specializes to 


“2 ie ee 
So felwe) — felw*) < Dok 3 < Fj (1 + log(T)). 
t=1 t=1 


When X = 0, and we set 7 = a and use problem 4.2 to rewrite (4.12) as 


T 
De fil wyemT+E yt crr+t “VF. 


= | 
Problems 
Problem 4.1 (Generalized Cauchy-Schwartz {1}) Show that (x,y) < 
ae llall? + § llyll? for all x,y € R” and o > 0. 

< 


Problem 4.2 (Bounding sum of a series {1}) Show that . ai 
Vb—a+1. Hint: Upper bound the sum by an integral. 


5 


Conditional Densities 


A number of machine learning algorithms can be derived by using condi- 
tional exponential families of distribution (Section 2.3). Assume that the 
training set {(271,4y1),---,;(@m;Ym)} was drawn iid from some underlying 
distribution. Using Bayes rule (1.15) one can write the likelihood 


p(6|X,¥) x p(O)p(¥|X, 6) = p() [| p(wilwi, 9), (5.1) 
i=1 
and hence the negative log-likelihood 


— log p(@|X, Y) = — Yo yi|x;, 9) — log p(@) + const. (5.2) 


Because we do not have any prior knowledge about the data, we choose a 
zero mean unit variance isotropic normal distribution for p(@). This yields 


1 m 
— log p(@1X,¥) = 5 |||? — $5 log p(yilars, 9) + const. (5.3) 
i=1 


Finally, if we assume a conditional exponential family model for p(y|z, 9), 
that is, 


ply|z, 8) = exp ((¢(x, y), 9) — g(4|x)) , (5.4) 
then 


— log p(6|X,Y) = 5 lal? +o O|x;) — ((xi, y:),0) + const. (5.5) 


where 


gels) = log S> exp ( y),9)), (5.6) 
yey 
is the log-partition function. Clearly, (5.5) is a smooth convex objective 


function, and algorithms for unconstrained minimization from Chapter 3 
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can be used to obtain the maximum aposteriori (MAP) estimate for 6. Given 
the optimal @, the class label at any given x can be predicted using 


y* = argmax p(y|z, 6). (5.7) 
y 


In this chapter we will discuss a number of these algorithms that can be 
derived by specializing the above setup. Our discussion unifies seemingly 
disparate algorithms, which are often discussed separately in literature. 


5.1 Logistic Regression 


We begin with the simplest case namely binary classification'. The key ob- 


servation here is that the labels y € {+1} and hence 

g(4|x) = log (exp ((e(@, +1), @)) + exp ((o(#, -1), 4))). (5.8) 
Define (2) = $(x,+1) — ¢(z,—1). Plugging (5.8) into (5.4), using the 
definition of ¢ and rearranging 


1 
p(y = 4+1\2,0) = and 


1+ exp ((-d(2), 0) 
1 


se 1+ exp ((6@), 6) 


or more compactly 


1 
1+ exp ((-vd(@),0)) , 


Since p(y|x, 0) is a logistic function, hence the name logistic regression. The 


p(y|z, 4) = (5.9) 


classification rule (5.7) in this case specializes as follows: predict +1 when- 
ever p(y = +1|x,6) > p(y = —1|2,0) otherwise predict —1. However 


Py = +1|z, 9) 1 f B 
1 (y= — tla, 0) = H2)9)> 


therefore one can equivalently use sign ((4(@), 0) as our prediction func- 


tion. Using (5.9) we can write the objective function of logistic regression 
as 


+ lI? + log (1+ exp ((—vid(as),)) 
i=1 


1 The name logistic regression is a misnomer! 
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To minimize the above objective function we first compute the gradient. 


ge 0((-siles)) 
VJ(0) = 64 oo exp ((-wte:),8)) (—yib(zi)) 


=6+5-(p(yilei, 9) — Yyid(ai). 
4=1 


Notice that the second term of the gradient vanishes whenever p(y;|x;,9) = 
1. Therefore, one way to interpret logistic regression is to view it as a method 
to maximize p(y;|x;,9) for each point (2;, y;) in the training set. Since the 
objective function of logistic regression is twice differentiable one can also 
compute its Hessian 


m 


V?I(9) =1—S 7 v(yilai, (1 — plyilri, 9))S(@i) Sai)", 


i=1 


where we used y? = 1. The Hessian can be used in the Newton method 
(Section 3.2.6) to obtain the optimal parameter 0. 


5.2 Regression 
5.2.1 Conditionally Normal Models 


fixed variance 


5.2.2 Posterior Distribution 


integrating out vs. Laplace approximation, efficient estimation (sparse greedy) 


5.2.3 Heteroscedastic Estimation 


explain that we have two parameters. not too many details (do that as an 
assignment). 


5.3 Multiclass Classification 
5.3.1 Conditionally Multinomial Models 


joint feature map 
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5.4 What is a CRF? 


e Motivation with learning a digit example 
e general definition 
e Gaussian process + structure = CRF 


5.4.1 Linear Chain CRFs 


e Graphical model 
e Applications 
e Optimization problem 


5.4.2 Higher Order CRFs 


e 2-d CRFs and their applications in vision 
e Skip chain CRFs 
e Hierarchical CRFs (graph transducers, sutton et. al. JMLR etc) 


5.4.38 Kernelized CRFs 


From feature maps to kernels 

The clique decomposition theorem 

The representer theorem 

Optimization strategies for kernelized CRE's 


5.5 Optimization Strategies 
5.5.1 Getting Started 
e three things needed to optimize 


— MAP estimate 
— log-partition function 
— gradient of log-partition function 


e Worked out example (linear chain?) 


5.5.2 Optimization Algorithms 
- Optimization algorithms (LBFGS, SGD, EG (Globerson et. al)) 


5.5.3 Handling Higher order CRFs 
- How things can be done for higher order CRF (briefly) 
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5.6 Hidden Markov Models 


Definition 

Discuss that they are modeling joint distribution p(x, y) 

The way they predict is by marginalizing out x 

Why they are wasteful and why CRFs generally outperform them 


5.7 Further Reading 
What we did not talk about: 


e Details of HMM optimization 

e CRF applied to predicting parse trees via matrix tree theorem (collins, 
koo et al) 

CRFs for graph matching problems 


e CRF with Gaussian distributions (yes they exist) 


5.7.1 Optimization 


issues in optimization (blows up with number of classes). structure is not 
there. can we do better? 


Problems 
Problem 5.1 Poisson models 
Problem 5.2 Bayes Committee Machine 


Problem 5.3 Newton / CG approach 
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Kernels and Function Spaces 


Kernels are measures of similarity. Broadly speaking, machine learning al- 
gorithms which rely only on the dot product between instances can be “ker- 
nelized” by replacing all instances of (x, 2’) by a kernel function k(, 2’). 
We saw examples of such algorithms in Sections 1.3.3 and 1.3.4 and we will 
see many more examples in Chapter 7. Arguably, the design of a good ker- 
nel underlies the success of machine learning in many applications. In this 
chapter we will lay the ground for the theoretical properties of kernels and 
present a number of examples. Algorithms which use these kernels can be 
found in later chapters. 


6.1 The Basics 


Let X denote the space of inputs and k : X x X > R be a function which 
satisfies 


k(a, 2’) = (®(2x), ®(z)) (6.1) 


where ® is a feature map which maps X into some dot product space H. In 
other words, kernels correspond to dot products in some dot product space. 
The main advantage of using a kernel as a similarity measure are threefold: 
First, if the feature space is rich enough, then simple estimators such as 
hyperplanes and half-spaces may be sufficient. For instance, to classify the 
points in Figure BUGBUG, we need a nonlinear decision boundary, but 
once we map the points to a 3 dimensional space a hyperplane suffices. 
Second, kernels allow us to construct machine learning algorithms in the 
dot product space H without explicitly computing ®(x). Third, we need not 
make any assumptions about the input space X other than for it to be a 
set. As we will see later in this chapter, this allows us to compute similarity 
between discrete objects such as strings, trees, and graphs. In the first half 
of this chapter we will present some examples of kernels, and discuss some 
theoretical properties of kernels in the second half. 
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6.1.1 Examples 
6.1.1.1 Linear Kernel 


Linear kernels are perhaps the simplest of all kernels. We assume that x € R” 
and define 


hie )=Ga.a)= Se. 


If 2 and a’ are dense then computing the kernel takes O(n) time. On the 
other hand, for sparse vectors this can be reduced to O(|nnz(x) N nnz(2’)|), 
where nnz(-) denotes the set of non-zero indices of a vector and | - | de- 
notes the size of a set. Linear kernels are a natural representation to use for 
vectorial data. They are also widely used in text mining where documents 
are represented by a vector containing the frequency of occurrence of words 
(Recall that we encountered this so-called bag of words representation in 
Chapter 1). Instead of a simple bag of words, one can also map a text to the 
set of pairs of words that co-occur in a sentence for a richer representation. 


6.1.1.2 Polynomial Kernel 


Given x € R”, we can compute a feature map ® by taking all the d-th 
order products (also called the monomials) of the entries of x. To illustrate 
with a concrete example, let us consider x = (2%1,x%2) and d = 2, in which 
case ®(x) = (a2, 23,0102, 0201). Although it is tedious to compute ®(z) 
and ®(2’) explicitly in order to compute k(x, x), there is a shortcut as the 
following proposition shows. 


Proposition 6.1 Let ®(x) (resp. ®(x')) denote the vector whose entries 
are all possible d-th degree ordered products of the entries of x (resp. x’). 
Then 


k(z,2') = (®(z2), ®(2')) = ((x,2'))*. (6.2) 


Proof By direct computation 


(®(z), ®(2’)) 21 ke) Aaah ad, 
ja Ja 


/ / / 
jt Ja 
= ((2,2"))* 
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The kernel (6.2) is called the polynomial kernel. An useful extension is the 
inhomogeneous polynomial kernel 


Kae) = (aye) + c)", (6.3) 


which computes all monomials up to degree d (problem 6.2). 


6.1.1.3 Radial Basis Function Kernels 
6.1.1.4 Convolution Kernels 


The framework of convolution kernels is a general way to extend the notion 
of kernels to structured objects such as strings, trees, and graphs. Let « € X 
be a discrete object which can be decomposed into P parts x» € Xp in many 
different ways. As a concrete example consider the string x = abc which can 
be split into two sets of substrings of size two namely {a,bc} and {ab,c}. 
We denote the set of all such decompositions as R(x), and use it to build a 
kernel on X as follows: 
P 
[ky «...*kp] (x, 2") = ye keel De): (6.4) 
ZER(x),z/ER(2') p=1 


Here, the sum is over all possible ways in which we can decompose x and 
zx’ into Z,...,£p and Z4,...,%p respectively. If the cardinality of R(x) is 
finite, then it can be shown that (6.4) results in a valid kernel. Although 
convolution kernels provide the abstract framework, specific instantiations 
of this idea lead to a rich set of kernels on discrete objects. We will now 
discuss some of them in detail. 


6.1.1.5 String Kernels 


The basic idea behind string kernels is simple: Compare the strings by 
means of the subsequences they contain. More the number of common sub- 
sequences, the more similar two strings are. The subsequences need not have 
equal weights. For instance, the weight of a subsequence may be given by the 
inverse frequency of its occurrence. Similarly, if the first and last characters 
of a subsequence are rather far apart, then its contribution to the kernel 
must be down-weighted. 

Formally, a string x is composed of characters from a finite alphabet 
and || denotes its length. We say that s is a subsequence of 2 = 7122... 2|z) 
Ley Py, xs Zia) for some 1 < 41 < ig <... < ty) < |x|. In particular, if 
i4j41 = 14; +1 then s is a substring of x. For example, acb is not a subsequence 
of adbc while abc is a subsequence and adc is a substring. Assume that there 
exists a function #(x,s) which returns the number of times a subsequence 
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s occurs in x and a non-negative weighting function w(s) > 0 which returns 
the weight associated with s. Then the basic string kernel can be written as 


k(a,2’) = S° #(a,8) #(2', 8) w(s). (6.5) 


s 


Different string kernels are derived by specializing the above equation: 

All substrings kernel: If we restrict the summation in (6.5) to sub- 
strings then [VS04] provide a suffix tree based algorithm which allows one 
to compute for arbitrary w(s) the kernel k(x, x’) in O(|x| + |x’|) time and 
memory. 

k-Spectrum kernel: The k-spectrum kernel is obtained by restricting 
the summation in (6.5) to substrings of length k. A slightly general variant 
considers all substrings of length up to k. Here k& is a tuning parameter 
which is typically set to be a small number (e.g., 5). A simple trie based 
algorithm can be used to compute the k-spectrum kernel in O((|x| + |x’|)k) 
time (problem 6.3). 

Inexact substring kernel: Sometimes the input strings might have 
measurement errors and therefore it is desirable to take into account inexact 
matches. This is done by replacing #(z,s) in (6.5) by another function 
#(x, s,¢€) which reports the number of approximate matches of s in x. Here 
€ denotes the number of mismatches allowed, typically a small number (e.g., 
3). By trading off computational complexity with storage the kernel can be 
computed efficiently. See [LIK03] for details. 

Mismatch kernel: Instead of simply counting the number of occurrences 
of a substring if we use a weighting scheme which down-weights the contribu- 
tions of longer subsequences then this yields the so-called mismatch kernel. 
Given an index sequence J = (i1,...,%%) with 1 < 41 < ig <<... < i < |a| 
we can associate the subsequence x(J) = 2,24, ...2;, with J. Furthermore, 
define |J| = i, — i1 +1. Clearly, |J| > & if J is not contiguous. Let A < 1 be 
a decay factor. Redefine 


#(2,s)= S> XM, (6.6) 


s=2(J) 


that is, we count all occurrences of s in x but now the weight associated with 
a subsequence depends on its length. To illustrate, consider the subsequence 
abc which occurs in the string abcebc twice, namely, abcebc and abcebc. The 
first occurrence is counted with weight \? while the second occurrence is 
counted with the weight A°. As it turns out, this kernel can be computed 
by a dynamic programming algorithm (problem BUGBUG) in O(|2] - |x’]) 
time. 
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6.1.1.6 Graph Kernels 


There are two different notions of graph kernels. First, kernels on graphs 
are used to compare nodes of a single graph. In contrast, kernels between 
graphs focus on comparing two graphs. A random walk (or its continuous 
time limit, diffusion) underlie both types of kernels. The basic intuition is 
that two nodes are similar if there are a number of paths which connect 
them while two graphs are similar if they share many common paths. To 
describe these kernels formally we need to introduce some notation. 

A graph G consists of an ordered set of n vertices V = {v1,v2,..., Un}, 
and a set of directed edges FE CV xV. A vertex vu; is said to be a neighbor 
of another vertex v; if they are connected by an edge, 1.e., if (vj,v;) € FE; 
this is also denoted vj ~ v;. The adjacency matrix of a graph is the n x n 
matrix A with A;; = 1 if vu; ~ v;, and 0 otherwise. A walk of length k on G 
is a sequence of indices ig, 21,...7, such that v;,_, ~ v;, for all l<r<hk. 

The adjacency matrix has a normalized cousin, defined A := D~!A, which 
has the property that each of its rows sums to one, and it can therefore 
serve as the transition matrix for a stochastic process. Here, D is a diag- 
onal matrix of node degrees, i.e., Di;i = d; = > j Ajj. A random walk on 
G is a process generating sequences of vertices v;,, Viz, Viz,... according to 
P(ix+ilt1,---t%) = Aiy,i,,,- The t*® power of A thus describes t-length walks, 
1.€., (A’) 5 is the probability of a transition from vertex v; to vertex vj; via 
a walk of length t (problem BUGBUG). If po is an initial probability dis- 
tribution over vertices, then the probability distribution p; describing the 
location of our random walker at time t is pp = A*pp. The j** component of 
pt denotes the probability of finishing a t-length walk at vertex v;. A random 
walk need not continue indefinitely; to model this, we associate every node 
v;, in the graph with a stopping probability q;,. The overall probability of 
stopping after t steps is given by q 'p,. 

Given two graphs G(V, F£) and G’(V’, E’), their direct product Gy is a 
graph with vertex set 


Vy = {(vi, up), EV, vi EV"S, (6.7) 
and edge set 
Ex = {((vi, ¥,), (vj, vs) + (i,j) € BA (v5 05) € B'}. (6.8) 


In other words, Gy, is a graph over pairs of vertices from G and G’, and 
two vertices in G, are neighbors if and only if the corresponding vertices 
in G and G’ are both neighbors; see Figure 6.1 for an illustration. If A and 
A’ are the respective adjacency matrices of G and G’, then the adjacency 
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ny 3) 
G2 


Fig. 6.1. Two graphs (G; & G2) and their direct product (Gx). Each node of the 
direct product graph is labeled with a pair of nodes (6.7); an edge exists in the 
direct product if and only if the corresponding nodes are adjacent in both original 
graphs (6.8). For instance, nodes 11’ and 32’ are adjacent because there is an edge 
between nodes 1 and 3 in the first, and 1’ and 2’ in the second graph. 


matrix of Gy, is A, = A@ A’. Similarly, A, = A@ A’. Performing a random 
walk on the direct product graph is equivalent to performing a simultaneous 
random walk on G and G’. If p and p’ denote initial probability distributions 
over the vertices of G and G’, then the corresponding initial probability 
distribution on the direct product graph is py := p® p’. Likewise, if qg and 
qd’ are stopping probabilities (that is, the probability that a random walk 
ends at a given vertex), then the stopping probability on the direct product 
graph is qx :=q@q. 

To define a kernel which computes the similarity between G and G’, one 
natural idea is to simply sum up cee px for all values of t. However, this 
sum might not converge, leaving the kernel value undefined. To overcome 
this problem, we introduce appropriately chosen non-negative coefficients 
u(t), and define the kernel between G and G’ as 


k(G,G") = SU a(t) ax Ay px. (6.9) 


This idea can be extended to graphs whose nodes are associated with labels 
by replacing the matrix A, with a matrix of label similarities. For appro- 
priate choices of y(t) the above sum converges and efficient algorithms for 
computing the kernel can be devised. See [?] for details. 

As it turns out, the simple idea of performing a random walk on the prod- 
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uct graph can be extended to compute kernels on Auto Regressive Moving 
Average (ARMA) models [VSV07]. Similarly, it can also be used to define 
kernels between transducers. Connections between the so-called rational ker- 
nels on transducers and the graph kernels defined via (6.9) are made explicit 
in [?]. 


6.2 Kernels 
6.2.1 Feature Maps 


give examples, linear classifier, nonlinear ones with r2-r3 map 


6.2.2 The Kernel Trick 
6.2.3 Examples of Kernels 


gaussian, polynomial, linear, texts, graphs 
- stress the fact that there is a difference between structure in the input 
space and structure in the output space 


6.3 Algorithms 

6.3.1 Kernel Perceptron 

6.3.2 Trivial Classifier 

6.3.3 Kernel Principal Component Analysis 
6.4 Reproducing Kernel Hilbert Spaces 


As it turns out, this class of functions coincides with the class of positive 
semi-definite functions. Intuitively, the notion of a positive semi-definite 
function is an extension of the familiar notion of a positive semi-definite 
matrix (also see Appendix BUGBUG): 


Definition 6.2 A realn x n symmetric matrix K satisfying 
S> aya; Ky > 0 (6.10) 
a9 
for all aj,a; € R is called positive semi-definite. If equality in (6.10) occurs 
only when Q1,...,Q@, =0, then K is said to be positive definite. 


Definition 6.3 Given a set of points 71,...,%» € X and a function k, the 
matrix 


Kij = Bet) (6.11) 
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is called the Gram matrix or the kernel matrix of k with respect to x1,...,%n- 


Definition 6.4 Let X be a nonempty set, k:X x X > R be a function. If 
k, gives rise to a positive (semi-)definite Gram matrix for all 21,...,% € X 
andn€N then k is said to be positive (semi-)definite. 


Clearly, every kernel function k of the form (6.1) is positive semi-definite. 
To see this simply write 


Osis k (te, ay ae (25,0;) A Xi, QjL; 
j i) 
tJ 


We now establish the converse, that is, we show that every positive semi- 
definite kernel function can be written as (6.1). Towards this end, define a 
map ® from X into the space of functions mapping X to R (denoted R*) via 
®(x) = k(-,x). In other words, ®(x) : X — R is a function which assigns the 
value k(x’,x) to 2’ € X. Next construct a vector space by taking all possible 
linear combinations of ®(z) 


— +e a,x) = » Chl: 2%), (6.12) 
i=l 


where i € N, a; € R, and 2; € X are arbitrary. This space can be endowed 
with a natural dot product 


g) _ S75 aiBjk(xi, 24). (6.13) 


i=1 j=1 


To see that the above dot product is well defined even though it contains 

the expansion coefficients (which need not be unique), note that (f,g) = 
eu tpPae Ar independent of a;. Similarly, for g, note that (f,g) = )7y_, aif (xi), 
this time pndecendent of 8;. This also shows that (f,g) is bilinear. Symme- 

try follows because (f,g) = (g, f), while the positive semi-definiteness of k 
implies that 


f) = Dd maj h(i, 25) > 0. (6.14) 


Applying (6.13) shows that for all functions (6.12) we have 
(f,k(,2)) = f(a). (6.15) 


In particular 


(hee), koa) = hey (6.16) 
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In view of these properties, k is called a reproducing kernel. By using (6.15) 
and the following property of positive semi-definite functions (problem 6.1) 


k(x, 2')* < k(x, 2) - k(x", 2’) (6.17) 
we can now write 


F(a)? =| (F.k(,2)) |< h(a, 2) (ff). (6.18) 


From the above inequality, f = 0 whenever (f, f) = 0, thus establishing 
(-,-) as a valid dot product. In fact, one can complete the space of functions 
(6.12) in the norm corresponding to the dot product (6.13), and thus get a 
Hilbert space H, called the reproducing kernel Hilbert Space (RKHS). 

An alternate way to define a RKHS is as a Hilbert space H on functions 
from some input space X to R with the property that for any f € H and 
x € X, the point evaluations f — f(x) are continuous (in particular, all 
points values f(x) are well defined, which already distinguishes an RKHS 
from many Ly Hilbert spaces). Given the point evaluation functional, one 
can then construct the reproducing kernel using the Riesz representation 
theorem. The Moore-Aronszajn theorem states that, for every positive semi- 
definite kernel on X x X, there exists a unique RKHS and vice versa. 

We finish this section by noting that (-,-) is a positive semi-definite func- 
tion in the vector space of functions (6.12). This follows directly from the 
bilinearity of the dot product and (6.14) by which we can write for functions 
fi,.--,fp and coefficients 71,...,Yp 


ys aj = (Son Dv | = 0; (6.19) 
J a J 


a 


6.4.1 Hilbert Spaces 


evaluation functionals, inner products 


6.4.2 Theoretical Properties 


Mercer’s theorem, positive semidefiniteness 


6.4.3 Regularization 


Representer theorem, regularization 
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6.5 Banach Spaces 
6.5.1 Properties 
6.5.2 Norms and Convex Sets 


- smoothest function (L2) - smallest coefficients (L1) - structured priors 
(CAP formalism) 


Problems 


Problem 6.1 Show that (6.17) holds for an arbitrary positive semi-definite 
function k. 


Problem 6.2 Show that the inhomogeneous polynomial kernel (6.3) is a 
valid kernel and that it computes all monomials of degree up to d. 


Problem 6.3 (k-spectrum kernel {2}) Given two strings x and x’ show 
how one can compute the k-spectrum kernel (section 6.1.1.5) in O((|a| + 
|x'|)k) time. Hint: You need to use a trie. 
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Linear Models 


A hyperplane in a space endowed with a dot product (-,-) is described by 
the set 


{x € H| (w, x) +b =0} (7.1) 


where w € H and b € R. Such a hyperplane naturally divides H into two 
half-spaces: {x € H|(w,xz) +6 > O} and {x € H|(w,x) +b < O}, and 
hence can be used as the decision boundary of a binary classifier. In this 
chapter we will study a number of algorithms which employ such linear 
decision boundaries. Although such models look restrictive at first glance, 
when combined with kernels (Chapter 6) they yield a large class of useful 
algorithms. 

All the algorithms we will study in this chapter maximize the margin. 
Given a set X = {21,...,2%m}, the margin is the distance of the closest point 
in X to the hyperplane (7.1). Elementary geometric arguments (Problem 7.1) 
show that the distance of a point x; to a hyperplane is given by | (w, xi) + 
b|/ ||w||, and hence the margin is simply 


min —————. 7.2 
i=1,...,m || w|| ee) 

Note that the parameterization of the hyperplane (7.1) is not unique; if we 
multiply both w and b by the same non-zero constant, then we obtain the 
same hyperplane. One way to resolve this ambiguity is to set 

min | (w,a;) +6] =1. 

ag 
In this case, the margin simply becomes 1/||w||. We postpone justification 
of margin maximization for later and jump straight ahead to the description 
of various algorithms. 


7.1 Support Vector Classification 


Consider a binary classification task, where we are given a training set 
{(v1,y1),---,(®@m;Ym)} with x; € H and y, € {+1}. Our aim is to find 
a linear decision boundary parameterized by (w,b) such that (w,2;) +b > 0 
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(w, 21) +b=+1 
(w, 22) +b=—-1 
(w, © — £2) =9 


w Be ee 
Ten? 2) = Tull 


Fig. 7.1. A linearly separable toy binary classification problem of separating the 
diamonds from the circles. We normalize (w, b) to ensure that minj=1 mm | (w,2i) + 
b| = 1. In this case, the margin is given by Tul as the calculation in the inset shows. 


whenever y; = +1 and (w, x;)+b < 0 whenever y; = —1. Furthermore, as dis- 
cussed above, we fix the scaling of w by requiring minj=1,..m | (w, xi) +0| = 1. 
A compact way to write our desiderata is to require y;((w,2;) + 6) > 1 for 
all i (also see Figure 7.1). The problem of maximizing the margin therefore 
reduces to 


1 
max =— (7.3a) 
wb ||wl| 
s.t. ys((w,2;) +b) > 1 for all i, (7.3b) 
or equivalently 

in Sill? (7-4a) 
ae a ae 
s.t. ys((w,2;) +b) > 1 for all 2. (7.4b) 


This is a constrained convex optimization problem with a quadratic objec- 
tive function and linear constraints (see Section 3.3). In deriving (7.4) we 
implicitly assumed that the data is linearly separable, that is, there is a 
hyperplane which correctly classifies the training data. Such a classifier is 
called a hard margin classifier. If the data is not linearly separable, then 
(7.4) does not have a solution. To deal with this situation we introduce 


7.1 Support Vector Classification 167 
non-negative slack variables €; to relax the constraints: 
yi((w, wi) +b) > 1—&. 


Given any w and 6 the constraints can now be satisfied by making €; large 
enough. This renders the whole optimization problem useless. Therefore, one 
has to penalize large €;. This is done via the following modified optimization 


problem: 
pe he a 
la (7.5a) 
s.t. yi((w,2;) +6) > 1-&; for alli (7.5b) 


where C' > 0 is a penalty parameter. The resultant classifier is said to be a 
soft margin classifier. By introducing non-negative Lagrange multipliers a; 
and (; one can write the Lagrangian (see Section 3.3) 


L(w,b,€,a, 8) = 5\lw|l* 4 oe Yai(t- yi((w, wi) + b)) )— SA 


Next take gradients with respect to w, b and € and set them to zero. 


Vwlb=w- Ss ajyyix; = 0 (7.6a) 
i=l 
Vel =—_ Ss" Ay = 0 (7.6b) 
v= 
C 
Vel = ——aj— 6 =0. 7.6 
& m a B ( c) 


Substituting (7.6) into the Lagrangian and simplifying yields the dual ob- 
jective function: 


1 m 
5 Sey crecry (as, 03) + > 04, (7.7) 
4,9 i=1 


which needs to be maximized with respect to a. For notational convenience 
we will minimize the negative of (7.7) below. Next we turn our attention 
to the dual constraints. Recall that a; > 0 and 6; > 0, which in conjunc- 
tion with (7.6c) immediately yields 0 < a; < oe Furthermore, by (7.6b) 
yo aiyi = 0. Putting everything together, the dual optimization problem 
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boils down to 


m 


. dl 
min 5 X Yiyj UiCy (Li, Bj) — > OQ (7.8a) 
s.t. >. ayy, = 0 (7.8b) 
i=l 
C 
0<a;<—. (7.8c) 
m 


If we let H be am x m matrix with entries Hi; = yiy; (vi, vj), while e, a, 
and y be m-dimensional vectors whose i-th components are one, a;, and y; 
respectively, then the above dual can be compactly written as the following 
Quadratic Program (QP) (Section 3.3.3): 


1 
min 50 Ho —ale (7.9a) 
st. aly =0 (7.9b) 
C 
0<a;<—. (7.9c) 
m 


Before turning our attention to algorithms for solving (7.9), a number of 
observations are in order. First, note that computing H only requires com- 
puting dot products between training examples. If we map the input data to 
a Reproducing Kernel Hilbert Space (RKHS) via a feature map ¢, then we 
can still compute the entries of H and solve for the optimal a. In this case, 
Ai; = yiyj (O(xi), O(@;3)) = yiyjsk(vi, xj), where k is the kernel associated 
with the RKHS. Given the optimal a, one can easily recover the decision 
boundary. This is a direct consequence of (7.6a), which allows us to write w 
as a linear combination of the training data: 


m 
w= > aiyid(xi), 
i=1 
and hence the decision boundary as 


(w,rz)+b= Ss ayyik(a;, x) +b. (7.10) 
i=1 
By the KKT conditions (Section 3.3) we have 
au(1 — & — yi((w, 24) + 6)) = 0 and BE; = 0. 


We now consider three cases for y;((w,2;) + 6) and the implications of the 
KKT conditions (see Figure 7.2). 
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Fig. 7.2. The picture depicts the well classified points (y;((w,2;) +b) > 1 in black, 
the support vectors y;((w, x;) +6) = 1 in blue, and margin errors y;((w, 7;) +b) <1 
in red. 


yi((w, x;) + b) < 1: In this case, €; > 0, and hence the KKT conditions 
imply that 8; = 0. Consequently, a; = pot (see (7.6c)). Such points 
are said to be margin errors. 


yi((w, x;) + b) > 1: In this case, & = 0, (1—& —yi((w, 1) +b)) < 0, and by 
the KKT conditions a; = 0. Such points are said to be well classified. 


It is easy to see that the decision boundary (7.10) does not change 
even if these points are removed from the training set. 


yi((w, xi) + b) = 1: In this case &; = 0 and {; > 0. Since a; is non-negative 
and satisfies (7.6c) it follows that 0 < a; < c Such points are said 
to be on the margin. They are also sometimes called support vectors. 


Since the support vectors satisfy y;((w,x;) +6) = 1 and y; € {+1} it follows 
that b = y; — (w,2;) for any support vector x;. However, in practice to 
recover 6 we average 


b=yi— >> (w, xi). (7-11) 


a 


over all support vectors, that is, points x; for which 0 < a; < c Because 
it uses support vectors, the overall algorithm is called C-Support Vector 
classifier or C-SV classifier for short. 
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7.1.1 A Regularized Risk Minimization Viewpoint 


A closer examination of (7.5) reveals that €; = 0 whenever y;((w, x;) +b) > 1. 
On the other hand, & = 1 — y;((w,x;) + b) whenever y;((w,a;) + 0) < 
1. In short, & = max(0,1 — y;((w,2;) + 6)). Using this observation one 
can eliminate €; from (7.5), and write it as the following unconstrained 
optimization problem: 


en Oe ee 
min 5 lel Be) (7.12) 


Writing (7.5) as (7.12) is particularly revealing because it shows that a 
support vector classifier is nothing but a regularized risk minimizer. Here 
the regularizer is the square norm of the decision hyperplane 3||w]|?, and 
the loss function is the so-called binary hinge loss (Figure 7.3): 


I(w, x,y) = max(0, 1 — y((w, z) + B)). (7.13) 


It is easy to verify that the binary hinge loss (7.13) is convex but non- 
differentiable (see Figure 7.3) which renders the overall objective function 
(7.12) to be convex but non-smooth. There are two different strategies to 
minimize such an objective function. If minimizing (7.12) in the primal, one 
can employ non-smooth convex optimizers such as bundle methods (Section 
3.2.7). This yields a d dimensional problem where d is the dimension of «. 
On the other hand, since (7.12) is strongly convex because of the presence 
of the $||w||? term, its Fenchel dual has a Lipschitz continuous gradient 
(see Lemma 3.10). The dual problem is m dimensional and contains linear 
constraints. This strategy is particularly attractive when the kernel trick is 
used or whenever d >> m. In fact, the dual problem obtained via Fenchel 
duality is very related to the Quadratic programming problem (7.9) obtained 
via Lagrange duality (problem 7.4). 


7.1.2 An Exponential Family Interpretation 


Our motivating arguments for deriving the SVM algorithm have largely 
been geometric. We now show that an equally elegant probabilistic interpre- 
tation also exists. Assuming that the training set {(71,41),.--,(%m;Ym)} 
was drawn iid from some underlying distribution, and using the Bayes rule 
(1.15) one can write the likelihood 


m 


p(X, Y) x p()p(Y 1X, 8) = (4) [[ p(wilai, 9), (7.14) 
i=1 
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y((w, x) + b) 


Fig. 7.3. The binary hinge loss. Note that the loss is convex but non-differentiable 
at the kink point. Furthermore, it increases linearly as the distance from the decision 
hyperplane y((w, 2) + b) decreases. 


and hence the negative log-likelihood 


— log p(@|X, Y) == Los (yi|vi,0) — log p(@) + const. (7.15) 


In the absence of any prior knowledge about the data, we choose a zero 
mean unit variance isotropic normal distribution for p(@). This yields 


1 m 
—log p(6|X,Y) = 5 \|9\|? — S “log p(yilai, 0) + const. (7.16) 
i=1 
The maximum aposteriori (MAP) estimate for @ is obtained by minimizing 
(7.16) with respect to 9. Given the optimal 0, we can predict the class label 
at any given x via 


y = argmax p(y|z, 6). (717) 
y 


Of course, our aim is not just to maximize p(y;|x;,0) but also to ensure 
that p(y|x;,@) is small for all y 4 y;. This, for instance, can be achieved by 
requiring 
Pp (yi | Tj, 9) 
P(y|xi, 9) 
As we saw in Section 2.3 exponential families of distributions are rather flex- 


ible modeling tools. We could, for instance, model p(y;|xz;, 0) as a conditional 
exponential family distribution. Recall the definition: 


p(yla, 0) = exp ((e(a, y), 0) — g(x) . (7.19) 


>, for all y £ y; and some 7 > 1. (7.18) 
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Here ¢(x, y) is a joint feature map which depends on both the input data x 
and the label y, while g(0|x) is the log-partition function. Now (7.18) boils 
down to 


Pyiles, 6) ex (Coan z max (vis 4)s0) >. (7.20) 


Maxyzy, P(y|xi, 4 YAY 


If we choose 7 such that logy = 1, set o(x,y) = $¢(x), and observe that 
y € {+1} we can rewrite (7.20) as 


(2 6(ai) — (—2) 6(@i), 6) = yi (G(ai), 6) = 1. (7.21) 


By replacing — log p(y;|x;, 0) in (7.16) with the condition (7.21) we obtain 
the following objective function: 

in 5 lial? (7.22a) 

min 5 22a 

Sti - oy ol@;), 0) > 1 tor all 4, (7.22b) 


which recovers (7.4), but without the bias b. The prediction function is 
recovered by noting that (7.17) specializes to 


y = argmax (¢(a,y),0) = argmax 5 (o(x), 0) = sign((d(x),6)). (7.23) 
ye{+1} ye{+1} 


As before, we can replace (7.21) by a linear penalty for constraint viola- 
P(yi|vi,0) 
MaxXyzy, p(y|xi,0) 
called the log-odds ratio, and the above discussion shows that SVMs can 


tion in order to recover (7.5). The quantity log is sometimes 
be interpreted as maximizing the log-odds ratio in the exponential family. 
This interpretation will be developed further when we consider extensions of 
SVMs to tackle multiclass, multilabel, and structured prediction problems. 


7.1.3 Specialized Algorithms for Training SVMs 


The main task in training SVMs boils down to solving (7.9). The m x m 
matrix A is usually dense and cannot be stored in memory. Decomposition 
methods are designed to overcome these difficulties. The basic idea here 
is to identify and update a small working set B by solving a small sub- 
problem at every iteration. Formally, let B Cc {1,...,m} be the working set 
and ag be the corresponding sub-vector of a. Define B = {1,...,m}\ B 
and ag analogously. In order to update ag we need to solve the following 
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sub-problem of (7.9) obtained by freezing ag: 


. Ll tT OT Hepp Hpp QB T OT 
min 5 [og of | | a oe ee [og og |e (7.24a) 
st. [ag ak ly=0 (7.24b) 
C 
0<a; < — foralli€ B. (7.24c) 
m 
Here, | Hep Hyp | is a permutation of the matrix H. By eliminating 
Hee Hep 
constant terms and rearranging, one can simplify the above problem to 
ie ale B(Aapop — 7.25 
Rb 5 Ope BeOR + ap(Hppag — e) (7.25a) 
s.t. apyp = —abyp (7.25b) 
C 
0<a; < — for allic B. (7.25¢e) 
m 


An extreme case of a decomposition method is the Sequential Minimal Op- 
timization (SMO) algorithm of Platt [Pla99], which updates only two coef- 
ficients per iteration. The advantage of this strategy as we will see below is 
that the resultant sub-problem can be solved analytically. Without loss of 
generality let B = {i,j}, and define s = y;/y;, [ GG | = (Hgpap —e)! 
and d= (—afyp/ys)- Then (7.25) specializes to 


1 
min 5 (Hie + Hyj0% + 2.901403) + CiQy + Cj A; (7.26a) 
49 G 
si. so; ao, =d (7.26b) 
C 
0<aj,a; < a (7.26c) 


This QP in two variables has an analytic solution. 


Lemma 7.1 (Analytic solution of 2 variable QP) Define bounds 


d—€ 
Le max(0,—") ifs>0 (7.27) 
max(0, 4) otherwise 


(7.28) 


-7C d—£& F 
min(=,—™) otherwise, 


a ee ifs >0 
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and auxiliary variables 
x = (Hii + Hyjs* — 2sHij) and (7.29) 
p= l(ee—G = Hyd + Aya). (7.30) 


The optimal value of (7.26) can be computed analytically as follows: If x = 0 
then 


H otherwise. 


. if p <0 
a= 


If x > 0, then a; = max(L, min(H, p/x)). In both cases, aj; = (d — sa;). 


Proof Eliminate the equality constraint by setting a; = (d— sa;). Due to 
the constraint 0 < aj < c it follows that sa; = d— aj can be bounded 
via d — pot < sa; < d. Combining this with 0 < aj < c one can write 
L <a; < H where L and H are given by (7.27) and (7.28) respectively. 
Substituting a; = (d—sa;) into the objective function, dropping the terms 
which do not depend on a;, and simplifying by substituting . and p yields 
the following optimization problem in a,: 
. lo 
tae gi xX — bP 
st. L<aj <H. 


First consider the case when y = 0. In this case, a; = L if p < 0 otherwise 
a; = H. On other hand, if y > 0 then the unconstrained optimum of the 
above optimization problem is given by p/x. The constrained optimum is 
obtained by clipping appropriately: max(L,min(H,p/x)). This concludes 
the proof. | 


To complete the description of SMO we need a valid stopping criterion as 
well as a scheme for selecting the working set at every iteration. In order 
to derive a stopping criterion we will use the KKT gap, that is, the extent 
to which the KKT conditions are violated. Towards this end introduce non- 
negative Lagrange multipliers b € R, A € R™ and uw € R”™ and write the 
Lagrangian of (7.9). 


L(a,b,d,#) = 50 Ha ae + baly Matp'(a e). (731) 
m 


If we let J(a) = 5a' Ha —a'e be the objective function and VJ(a) = 
Ha-—e its gradient, then taking gradient of the Lagrangian with respect to 
qa and setting it to 0 shows that 


VJ(a) + by =A— yp. (7.32) 


7.1 Support Vector Classification 175 


Furthermore, by the KKT conditions we have 
C 
Aja; = 0 and p;(— — aj) = 0, (7.33) 
m 
with \; > 0 and yp; > 0. Equations (7.32) and (7.33) can be compactly 
rewritten as 
VJ(a); + by; > 0 if a; = 0 (7.34a) 
. 
m 


VJ(aQ); + by; < 0 if a; = (7.34b) 


C 
VJ (a); + by, = 0 if 0< aj < = (7.34c) 


Since y; € {+1}, we can further rewrite (7.34) as 


—yiVJI(a); < b for alli € Iup 
—y;V J(a); > b for all ¢ € Igown, 


where the index sets Jy, and Igown are defined as 
: C 
lig = (Gg < 71H = 1 or a; > 0,y; = —1} (7.35a) 
C 
Idown = {i 1a, << —, y= —-lora; > 0,y; = ie (7.35b) 
m 


In summary, the KKT conditions imply that a is a solution of (7.9) if and 
only if 

m(a) < M(a) 
where 


m(a) = max —y,VJ(a); and M(a) = min —y,VJ(a);. (7.36) 


t€ up t€Lldown 


Therefore, a natural stopping criterion is to stop when the KKT gap falls 
below a desired tolerance e, that is, 


m(a) < M(a) +e. (7.37) 


Finally, we turn our attention to the issue of working set selection. The 
first order approximation to the objective function J(a@) can be written as 


J(a+d) x J(a) + VI(a) ld. 


Since we are only interested in updating coefficients in the working set B 
we set d' = [ di 0 I in which case we can rewrite the above first order 
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approximation as 
VJ(a) Rdg & J(a+d)— J(a). 


From among all possible directions dg we wish to choose one which decreases 
the objective function the most while maintaining feasibility. This is best 
expressed as the following optimization problem: 


min VJ(a) Rdg (7.38a) 
st. ybdg =0 (7.38b) 
d;>0ifa;=Oandie B (7.38c) 
ad: <0 if a; = c andie B (7.38d) 
—-1l<d,<1. (7.38e) 


Here (7.38b) comes from y'(a +d) = 0 and y'a = 0, while (7.38c) and 
(7.38d) comes from 0 < aj; < c Finally, (7.38e) prevents the objective 
function from diverging to —oo. If we specialize (7.38) to SMO, we obtain 


min VJ (a)idi + VI (a) jd; (7.39a) 
st. gid; + yjd; =0 (7.39b) 
dy, > 0 if a, =O and ke {1,7} (7.39c) 
dy =< 0 if ag = c and k € {i,j} (7.39d) 
—-l<d<1forke {i,j}. (7.39e) 


At first glance, it seems that choosing the optimal 7 and j from the set 
{1,...,m}x{1,...m} requires O(m?) effort. We now show that O(m) effort 
suffices. 

Define new variables d, = y,dy for k € {i,j}, and use the observation 
yx © {+1} to rewrite the objective function as 


(-—yiVI(a)i + yjyVI(a);) dj. 


Consider the case —VJ(a)iyis > —VJ(a)j;y;- Because of the constraints 
(7.39c) and (7.39d) if we choose i € Iv, and j € Idown, then d; = —1 and 
d; = 1 is feasible and the objective function attains a negative value. For 
all other choices of ¢ and j (7,7 € up; 1,9 © Igown} t € Idown and 7 € Iup) 
the objective function value of 0 is attained by setting d; = d; = 0. The 
case —VJ(a)j;y; > —VJ(a)iyi is analogous. In summary, the optimization 
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problem (7.39) boils down to 


Pe ee (a)i — yjVI(Q)j = main VJ (a)i — ae av (a);, 
which clearly can be solved in O(m) time. Comparison with (7.36) shows 
that at every iteration of SMO we choose to update coefficients a; and a; 
which maximally violate the KKT conditions. 


7.2 Extensions 
7.2.1 The v trick 


In the soft margin formulation the parameter C' is a trade-off between two 
conflicting requirements namely maximizing the margin and minimizing the 
training error. Unfortunately, this parameter is rather unintuitive and hence 
difficult to tune. The y-SVM was proposed to address this issue. As Theorem 
7.3 below shows, v controls the number of support vectors and margin errors. 
The primal problem for the v-SVM can be written as 


i, Shai i 
i 7.40 
gain 5ilwll* — p+ mee (7.40a) 
s.t. yi((w,x2;) +6) > op — & for alli (7.40b) 
€; > 0, and p> 0. (7.40c) 


As before, if we write the Lagrangian by introducing non-negative Lagrange 
multipliers, take gradients with respect to the primal variables and set them 
to zero, and substitute the result back into the Lagrangian we obtain the 
following dual: 


1 


min 5 oy Ui Ong Di Zz) (7.41a) 
UJ 
m 
ea: S> ay; = 0 (7.41b) 
i=1 


ea (7.41c) 
i=1 


1 
tia (7.41d) 
ym 


It turns out that the dual can be further simplified via the following lemma. 
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Lemma 7.2 Let v € [0,1] and (7.41) be feasible. Then there is at least one 
solution a which satisfies )>, a; = 1. Furthermore, if the final objective value 
of (7.41) is non-zero then all solutions satisfy 5), a; = 1. 


Proof The feasible region of (7.41) is bounded, therefore if it is feasible 
then there exists an optimal solution. Let a denote this solution and assume 
that )>, a; > 1. In this case we can define 


_ 1 

a= =—a, 

a 5 

and easily check that @ is also feasible. As before, let H denote am x m 
matrix with Hi; = yy; (i,2;). Since a is the optimal solution of (7.41) it 
follows that 


This implies that either 5a'H a = 0, in which case @ is an optimal solution 
with the desired property or 5a'H a # 0, in which case all optimal solutions 


satisfy }°, a; = 1. a 


In view of the above theorem one can equivalently replace (7.41) by the 
following simplified optimization problem with two equality constraints 


. dl 
min 5 oS Vip Ong (5 By) (7.42a) 
UJ 
m 
s.t. > ay; = 0 (7.42b) 
i=1 


Yast (7.42c) 
i=1 


1 
tifa, (7.42d) 
ym 


The following theorems, which we state without proof, explain the signif- 
icance of v and the connection between v-SVM and the soft margin formu- 
lation. 


Theorem 7.3 Suppose we run v-SVM with kernel k on some data and 
obtain p > 0. Then 


(i) v is an upper bound on the fraction of margin errors, that is points 
for which y; ((w, ri) + b;) < p. 
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(ii) v is a lower bound on the fraction of support vectors, that is points 
for which y; ((w,x;) + b;) = p. 

(iii) Suppose the data (X,Y) were generated iid from a distribution p(x, y) 
such that neither p(x,y = +1) or p(x,y = —1) contain any discrete 
components. Moreover, assume that the kernel k is analytic and non- 
constant. With probability 1, asympotically, v equals both the fraction 
of support vectors and fraction of margin errors. 


Theorem 7.4 If (7.40) leads to a decision function with p > 0, then (7.5) 
with C = ; leads to the same decision function. 


7.2.2 Squared Hinge Loss 


In binary classification, the actual loss which one would like to minimize is 
the so-called 0-1 loss 


0 if y((w,z)+6)>1 


(7.43) 
1 otherwise . 


[Gy = 


This loss is difficult to work with because it is non-convex (see Figure 7.4). In 


+ loss 


y((w, x) + 6) 


Fig. 7.4. The 0-1 loss which is non-convex and intractable is depicted in red. The 
hinge loss is a convex upper bound to the 0-1 loss and shown in blue. The square 
hinge loss is a differentiable convex upper bound to the 0-1 loss and is depicted in 
green. 


fact, it has been shown that finding the optimal (w, b) pair which minimizes 
the 0-1 loss on a training dataset of m labeled points is NP hard [BDELO03). 
Therefore various proxy functions such as the binary hinge loss (7.13) which 
we discussed in Section 7.1.1 are used. Another popular proxy is the square 
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hinge loss: 
I(w, x,y) = max(0, 1 — y((w, x) + b))?. (7.44) 


Besides being a proxy for the 0-1 loss, the squared hinge loss, unlike the 
hinge loss, is also differentiable everywhere. This sometimes makes the opti- 
mization in the primal easier. Just like in the case of the hinge loss one can 
derive the dual of the regularized risk minimization problem and show that 
it is a quadratic programming problem (problem 7.5). 


7.2.8 Ramp Loss 


The ramp loss 
I(w, x,y) = min(1 — s, max(0, 1 — y((w, x) + b))) (7.45) 


parameterized by s < 0 is another proxy for the 0-1 loss (see Figure 7.5). 
Although not convex, it can be expressed as the difference of two convex 
functions 


leone(w, Z, y) = max(0, 1 — y((w, x) + b)) and 
leave(w, Z, y) = max(0, s — y((w, x) + d)). 


Therefore the Convex-Concave procedure (CCP) we discussed in Section 


Fig. 7.5. The ramp loss depicted here with s = —0.3 can be viewed as the sum 
of a convex function namely the binary hinge loss (left) and a concave function 
min(0,1—y((w, x) +6)) (right). Viewed alternatively, the ramp loss can be written 
as the difference of two convex functions. 


3.5.1 can be used to solve the resulting regularized risk minimization problem 
with the ramp loss. Towards this end write 


m 


eu Si lave) . (7.46) 


m 
i=1 


Pic age ae 
J(w) = 5 lull + a D,boone(ws 2 4) 


Jeonc(w) Jeave(w) 
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Recall that at every iteration of the CCP we replace Jeave(w) by its first 
order Taylor approximation, computing which requires 


Ow J (w) = < y Onbeone(W, Xi, ii). (7.47) 
i=1 


This in turn can be computed as 


-1 ifs>y((w,xz) +b) 


(7.48) 
0 otherwise. 


rishaniel Wi Pes i) _ OiYixs with 0; = 


Ignoring constant terms, each iteration of the CCP algorithm involves solv- 
ing the following minimization problem (also see (3.134)) 


1 Ce Ce 
J(w) = pill? - = S- leonc(W, Li, Yi) — (< sain w. (7.49) 
i=1 i=1 


Let 6 denote a vector in R’™ with components 6;. Using the same notation 
as in (7.9) we can write the following dual optimization problem which is 
very closely related to the standard SVM dual (7.9) (see problem 7.6) 


1 
min 50 Ho —ale (7.50a) 
s.t. aly=0 (7.50b) 
C C 
——d <a; < —(e—9). (7.50c) 
m m 


In fact, this problem can be solved by a SMO solver with minor modifica- 
tions. Putting everything together yields Algorithm 7.1. 


Algorithm 7.1 CCP for Ramp Loss 
: Initialize 6° and a? 


1 
2: repeat 

3: Solve (7.50) to find a‘t! 

4: Compute 6’+! using (7.48) 
5: until 6'+! = 5 


7.3 Support Vector Regression 


As opposed to classification where the labels y; are binary valued, in re- 
gression they are real valued. Given a tolerance ¢€, our aim here is to find a 


182 7 Linear Models 


+ loss 


— ((w, x) +b) 


ess y 
Fig. 7.6. The € insensitive loss. All points which lie within the € tube shaded in 


gray incur zero loss while points outside incur a linear loss. 


hyperplane parameterized by (w,b) such that 
lyi — ((w, vi) + 6)| Se. (7.51) 


In other words, we want to find a hyperplane such that all the training data 
lies within an € tube around the hyperplane. We may not always be able to 
find such a hyperplane, hence we relax the above condition by introducing 
slack variables et and €; and write the corresponding primal problem as 


1 on 
pian, gllel? + m 2 +&) (7.52a) 
s.t. yi — ((w, ai) +b) < e+ &> for all i (7.52b) 
((w, 2) +6) —y, S e+ € for alli (7.52c) 
é+ > 0, and & > 0. (7.52d) 


The Lagrangian can be written by introducing non-negative Lagrange mul- 
tipliers a} 5 Os B: and 6; : 


L(w,b, +, € 0,07, B+, B-) = lhl? + < SUG +) — BPE + BFE) 
t=1 i=1 
+ Laty ((w, i) +b) —e—€7) 


+ Yay ((w,2;) +b) —y, —€-—§€-). 
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Taking gradients with respect to the primal variables and setting them to 
0, we obtain the following conditions: 


v= CH — a; )x; (7.53) 


Sa = Sa (7.54) 


i=1 i=1 

oy + BF = . (7.55) 
m 

a, +B = a (7.56) 
m 


Noting that ath} gt} > 0 and substituting the above conditions into 
the Lagrangian yields the dual 


. 1 } i 
pu 5 dei — a; )(ajp — oF) (x4, £5) (7.57a) 
ig 
+e) (af +az)— > \ ulat — 07) 
i=1 i=1 

Sib: Ss" a = Soa (7.57b) 

i=1 i=1 
O<ai< — (7.57c) 

m 
PLa- = an (7.57d) 

m 


This is a quadratic programming problem with one equality constraint, and 
hence a SMO like decomposition method can be derived for finding the 
optimal coefficients at and a~ (Problem 7.7). 

As a consequence of (7.53), analogous to the classification case, one can 
map the data via a feature map ¢ into an RKHS with kernel k and recover 
the decision boundary f(x) = (w, 6(x)) + 6 via 

m m 
f(x) = Sat - oF) b(x)i, O(@)) +b = Slat — oF )k(ai,2) +b. (7.58) 
i=1 i=1 

Finally, the KKT conditions 


(F- ar) g =o (S-a7) = 0ana 
m m 


a (((w, ai) +b) -—yi-—€-E-) =0 af (yi — ((w, a4) +b) —€ — £7) =, 
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allow us to draw many useful conclusions: 


e Whenever |y; — ((w,2;) + )| < €, this implies that & = €& = af = 


u 


a; = 0. In other words, points which lie inside the € tube around the 


hyperplane (w,x) + 6 do not contribute to the solution thus leading to 
sparse expansions in terms of a. 


e If ((w,a;)+b)—y; > € we have &> > 0 and therefore a; = ©. On the other 
hand, €* = 0 and af = 0. The case y; — ((w,x;) +b) > € is symmetric 
and yields €~ = 0, ee > 0, at = co and a; = 0. 

e Finally, if ((w,2;) + b) — y; = € we have & =0 and0<a; < &, while 

é+ = 0 and a} = 0. Similarly, when y; — ((w,2:) + b) = € we obtain 

é*=0,0<at < £,¢ =O0anda, =0. 


7 4 o— m 


Note that aj and a; are never simultaneously non-zero. 


7.38.1 Incorporating General Loss Functions 


Using the same reasoning as in Section 7.1.1 we can deduce from (7.52) that 
the loss function of support vector regression is given by 


I(w, x,y) = max(0, |y — (w, x) | —€). (7.59) 


It turns out that the support vector regression framework can be easily 
extended to handle other, more general, convex loss functions such as the 
ones found in Table 7.1. Different losses have different properties and hence 
lead to different estimators. For instance, the square loss leads to penalized 
least squares (LS) regression, while the Laplace loss leads to the penalized 
least absolute deviations (LAD) estimator. Huber’s loss on the other hand is 
a combination of the penalized LS and LAD estimators, and the pinball loss 
with parameter 7 € [0,1] is used to estimate 7-quantiles. Setting 7 = 0.5 
in the pinball loss leads to a scaled version of the Laplace loss. If we define 
€ = y—(w, 2), then it is easily verified that all these losses can all be written 
as 


It(€-—€) if€>e 
I(w,z,y)=<I-(-E-€) iff€<e (7.60) 
0 if € € [-e, €]. 


For all these different loss functions, the support vector regression formu- 
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lation can be written in a unified fashion as follows 


i teh oC She a pape 
Lm, gle +S LGA @) (7.618) 
s.t. yi — ((w, xi) +b) < e+ for all i (7.61b) 
((w, 21) +6) -—yi <e+& for alli (7.61c) 
é* >0, and & > 0. (7.61d) 
The dual in this case is given by 
° 1 | ih 
ae 5 da — a; )(ajF — oF) (x4, £5) (7.62a) 
ij 
- 2 ret) + TE) +e lat +07) - omlat - a7) 
=) =e i=1 a ‘ 
s.t So a; = Sa, (7.62b) 
i=1 i=1 
bea Soe) (7.62c) 
peers (7.624) 
gitct ine et | Katt) > aft (7.62e) 


Here Tt (€) = I*(€) — €0¢I* (€) and T~ (€) = 17 (€) — 0¢l- (€). We now show 
how (7.62) can be specialized to the pinball loss. Clearly, 1+ (€) = T& while 
I~ (—€) = (r-1)€, and hence I~ (€) = (1—7)€. Therefore, Tt (€) = (7 -1)E- 
&(r — 1) = 0. Similarly T~ (€) = 0. Since OgI+(€) = 7 and dgI- (€) = (1-7) 
for all € > 0, it follows that the bounds on att} can be computed as 
0< at < cr and 0 < a, < £(1—7). If we denote a = at —a™ and 


Table 7.1. Various loss functions which can be used in support vector 
regression. For brevity we denote y — (w,x) as € and write the loss 
I(w, x,y) in terms of €. 
c-insensitive loss max(0,|f| — €) 

Laplace loss |&| 

Square loss ||? 
ao | Oise 
lf} 5 otherwise 
i. if €>0 


(r-—1)€ otherwise. 


Huber’s robust loss { 


Pinball loss 
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observe that ¢ = 0 for the pinball loss then (7.62) specializes as follows: 


m 


min 5 ny ( LEF5) ~ Dyas (7.63a) 
i,j 
st. Sai =0 (7.63b) 
i=1 
C C 
—(r-1) <a; < —7. 7.63 
Ctr 1) <a < Sr (7.636) 


Similar specializations of (7.62) for other loss functions in Table 7.1 can be 
derived. 


7.8.2 Incorporating the v Trick 


One can also incorporate the v trick into support vector regression. The 
primal problem obtained after incorporating the v trick can be written as 


1 {i -<@ . 7 
ae 5 llewil” | (- sa +&, ) (7.64a) 


i=1 
s.t. ((w,a2;) +b) —y% <e+6;" for all i (7.64b) 
yi — ((w,2;) +b) Se+€ for allt (7.64c) 
e+ >0,é >0, ande>0. (7.64d) 


Proceeding as before we obtain the following simplified dual 


min 5 — af (a5 — a;) Vitayty) ~ nto ; — az) (7.65a) 


st. S “(az —at) =0 (7.65b) 


S\(e7 +07) =1 (7.65¢) 
i=1 
i 
jee — (7.65d) 
ym 
1 
Of a, = —, (7.65e) 
ym 


7.4 Novelty Detection 


The large margin approach can also be adapted to perform novelty detection 
or quantile estimation. Novelty detection is an unsupervised task where one 
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is interested in flagging a small fraction of the input X = {xj,...,2%m} as 
atypical or novel. It can be viewed as a special case of the quantile estimation 
task, where we are interested in estimating a simple set C such that Pr(a € 
C) > pw for some pw € [0,1]. One way to measure simplicity is to use the 
volume of the set. Formally, if |C| denotes the volume of a set, then the 
quantile estimation task is to estimate 


arginf{|C| s.t. Pr(x € C) > p}. (7.66) 
Given the input data X one can compute the empirical density 


3 ifxe X 


0 otherwise, 


and estimate its (not necessarily unique) j-quantiles. Unfortunately, such 
estimates are very brittle and do not generalize well to unseen data. One 
possible way to address this issue is to restrict C to be simple subsets such 
as spheres or half spaces. In other words, we estimate simple sets which 
contain yp fraction of the dataset. For our purposes, we specifically work 
with half-spaces defined by hyperplanes. While half-spaces may seem rather 
restrictive remember that the kernel trick can be used to map data into 
a high-dimensional space; half-spaces in the mapped space correspond to 
non-linear decision boundaries in the input space. Furthermore, instead of 
explicitly identifying C we will learn an indicator function for C, that is, a 
function f which takes on values —1 inside C and —1 elsewhere. 

With $||w|l? as a regularizer, the problem of estimating a hyperplane such 
that a large fraction of the points in the input data X lie on one of its sides 
can be written as: 


ie te Dee 
min 5 |w\|? + > 2 —p (7.67a) 
s.t. (w, xi) > p—& for alli (7.67b) 


Clearly, we want p to be as large as possible so that the volume of the half- 
space (w, x) > p is minimized. Furthermore, v € [0,1] is a parameter which 
is analogous to v we introduced for the v-SVM earlier. Roughly speaking, 
it denotes the fraction of input data for which (w,2;) < p. An alternative 
interpretation of (7.67) is to assume that we are separating the data set X 
from the origin (See Figure 7.7 for an illustration). Therefore, this method 
is also widely known as the one-class SVM. 
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Fig. 7.7. The novelty detection problem can be viewed as finding a large margin 
hyperplane which separates v fraction of the data points away from the origin. 


The Lagrangian of (7.67) can be written by introducing non-negative 
Lagrange multipliers a;, and (;: 


L(w,&,p,,8) = 5lhwll? + —— J & — p+ Dail ~ & — (wai) — > BE 
i=1 i=1 i=1 


By taking gradients with respect to the primal variables and setting them 
to 0 we obtain 


v= S° OL; (7.68) 
i=1 
1 1 
Ga —=— ps —— (7.69) 
vm vm 
Soe, (7.70) 
i=1 


Noting that a;,8; > 0 and substituting the above conditions into the La- 


grangian yields the dual 


1 
min 5D UG (055303) (7.71a) 
a9 
1 
st. O< a; < — (7.71b) 
vm 


ee (7.71c) 
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This can easily be solved by a straightforward modification of the SMO 
algorithm (see Section 7.1.3 and Problem 7.7). Like in the previous sections, 
an analysis of the KKT conditions shows that 0 < a if and only if (w,x;) < p; 
such points are called support vectors. As before, we can replace (xj, xj) by 
a kernel k(a;,7;) to transform half-spaces in the feature space to non-linear 
shapes in the input space. The following theorem explains the significance 
of the parameter v. 


Theorem 7.5 Assume that the solution of (7.71) satisfies p 4 0, then the 
following statements hold: 


(i) v is an upper bound on the fraction of support vectors, that is points 
for which (w,x;) < p. 

(ii) Suppose the data X were generated independently from a distribution 
p(x) which does not contain discrete components. Moreover, assume 
that the kernel k is analytic and non-constant. With probability 1, 
asympotically, v equals the fraction of support vectors. 


7.5 Margins and Probability 


discuss the connection between probabilistic models and linear classifiers. 
issues of consistency, optimization, efficiency, etc. 


7.6 Beyond Binary Classification 


In contrast to binary classification where there are only two possible ways 
to label a training sample, in some of the extensions we discuss below each 
training sample may be associated with one or more of k possible labels. 
Therefore, we will use the decision function 


y” = argmax f(x,y) where f(x,y) = (d(2,y),w). (7.72) 
ye {l,...,k} 

Recall that the joint feature map $(2,y) was introduced in section 7.1.2. 
One way to interpret the above equation is to view f(x, y) as a compatibility 
score between instance x and label y; we assign the label with the highest 
compatibility score to x. There are a number of extensions of the binary 
hinge loss (7.13) which can be used to estimate this score function. In all 
these cases the objective function is written as 


x i 
min J(w) = 5 \|ew||? + = Sle oy) (7.73) 
v=) 
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Here } is a scalar which trades off the regularizer 4 |||? with the empirical 
risk 4 Soi, Uw, 2, yi). Plugging in different loss functions yields classifiers 
for different settings. Two strategies exist for finding the optimal w. Just 
like in the binary SVM case, one can compute and maximize the dual of 
(7.73). However, the number of dual variables becomes m|Y|, where m is the 
number of training points and |Y| denotes the size of the label set. The second 
strategy is to optimize (7.73) directly. However, the loss functions we discuss 
below are non-smooth, therefore non-smooth optimization algorithms such 
as bundle methods (section 3.2.7) need to be used. 


7.6.1 Multiclass Classification 


In multiclass classification a training example is labeled with one of k pos- 
sible labels, that is, Y = {1,...,k}. We discuss two different extensions of 
the binary hinge loss to the multiclass setting. It can easily be verified that 


setting Y= {+1} and (a, y) = $¢(2) recovers the binary hinge loss in both 


cases. 


7.6.1.1 Additive Multiclass Hinge Loss 


A natural generalization of the binary hinge loss is to penalize all labels 
which have been misclassified. The loss can now be written as 


I(w,z,y) = a max (0,1 — ((d(z, y) — o(2,y'), w))) - (7.74) 
yl FY 


7.6.1.2 Mazimum Multiclass Hinge Loss 


Another variant of (7.13) penalizes only the maximally violating label: 


10,0, 4) += mma (0. max(1 — (6(2,0) - oa), w))) (175) 
yl FY 
Note that both (7.74) and (7.75) are zero whenever 
f(,y) = (O(@,y),w) = 1+ max (¢(a,y'),w) = 1+ max f(x,y’). (7.76) 
yl #Y yl #Y 


In other words, they both ensure an adequate margin of separation, in this 
case 1, between the score of the true label f(x,y) and every other label 
f(x,y’). However, they differ in the way they penalize violators, that is, la- 
bels y’ 4 y for which f(x,y) < 1+ f(x,y’). In one case we linearly penalize 
the violators and sum up their contributions while in the other case we lin- 
early penalize only the maximum violator. In fact, (7.75) can be interpreted 
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as the log odds ratio in the exponential family. Towards this end choose 7 
such that log = 1 and rewrite (7.20): 


log Putt) = ( (avy) — max (a,y!),w) > 1 
y FY 


MAaXy/ Ay p(y’ |x, w) 


Rearranging yields (7.76). 


7.6.2 Multilabel Classification 


In multilabel classification one or more of k possible labels are assigned to 
a training example. Just like in the multiclass case two different losses can 
be defined. 


7.6.2.1 Additive Multilabel Hinge Loss 


If we let ¥, C Y denote the labels assigned to x, and generalize the hinge 
loss to penalize all labels y/ ¢ Y, which have been assigned higher score than 
some y € Y,, then the loss can be written as 


I(w,z,y) = » max (0,1 — ((d(z,y) — O(z,y'),w))). (7.77) 


yeve and y/¢Yx 


7.6.2.2 Maximum Multilabel Hinge Loss 


Another variant only penalizes the maximum violating pair. In this case the 
loss can be written as 


I(w, 2, y) = max (0. max [1 = ((d(z, y) — d(x, y'), )))) . (7.78) 


yeYay’ EYe 


One can immediately verify that specializing the above losses to the mul- 
ticlass case recovers (7.74) and (7.75) respectively, while the binary case 
recovers (7.13). The above losses are zero only when 

min (x,y) = min ($(x,y),w) 21+ max (d(a,y'),w) =1+ Lea f(z, y’). 
This can be interpreted as follows: The losses ensure that all the labels 
assigned to x have larger scores compared to labels not assigned to x with 
the margin of separation of at least 1. 

Although the above loss functions are compatible with multiple labels, 
the prediction function argmax, f(x,y) only takes into account the label 
with the highest score. This is a significant drawback of such models, which 
can be overcome by using a multiclass approach instead. Let |Y| be the 
size of the label set and z € Rl denote a vector with +1 entries. We set 
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%y = +1 if the y € Yz and z, = —1 otherwise, and use the multiclass loss 
(7.75) on z. To predict we compute z* = argmax, f(x, z) and assign to x 
the labels corresponding to components of z* which are +1. Since z can 
take on 2!¥! possible values, this approach is not feasible if |Y| is large. To 
tackle such problems, and to further reduce the computational complexity 
we assume that the labels correlations are captured via a |Y| x |Y| positive 
semi-definite matrix P, and (x,y) can be written as ¢(x) ® Py. Here ® 
denotes the Kronecker product. Furthermore, we express the vector w as 
an x |Y¥| matrix W, where n denotes the dimension of ¢(x). With these 
assumptions (¢(x) @ P(z — 2’), w) can be rewritten as 


(o() WP, (2-2) = 3° ola)" WP] (i — 4), 


a 


and (7.78) specializes to 


I(w, x, z) := max (0. (: — oo |o(a) "WPI (a- »))) . (7.79) 
F Zitz a 

A analogous specialization of (7.77) can also be derived wherein the mini- 

mum is replaced by a summation. Since the minimum (or summation as the 

case may be) is over || possible labels, computing the loss is tractable even 

if the set of labels Y is large. 


7.6.3 Ordinal Regression and Ranking 


We can generalize our above discussion to consider slightly more general 
ranking problems. Denote by Y the set of all directed acyclic graphs on N 
nodes. The presence of an edge (i,7) in y € Y indicates that 7 is preferred 
to j. The goal is to find a function f(x,i) which imposes a total order on 
{1,...,.N} which is in close agreement with y. Specifically, if the estimation 
error is given by the number of subgraphs of y which are in disagreement 
with the total order imposed by f, then the additive version of the loss can 
be written as 


Iw, 2,y) = max (0,1 —(f(#,%) — f(#,J))), (7.80) 


where A(y) denotes the set of all possible subgraphs of y. The maximum 
margin version, on the other hand, is given by 


I(w,2,y) = Ee (0,1 — (f(x, 7) — f(z, 9))). (7.81) 
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In other words, we test for each subgraph G of y if the ranking imposed by G 
is satisfied by f. Selecting specific types of directed acyclic graphs recovers 
the multiclass and multilabel settings (problem 7.9). 


7.7 Large Margin Classifiers with Structure 
7.7.1 Margin 


define margin pictures 


7.7.2 Penalized Margin 


different types of loss, rescaling 


7.7.8 Nonconvex Losses 


the max - max loss 


7.8 Applications 

7.8.1 Sequence Annotation 
7.8.2 Matching 

7.8.3 Ranking 

7.8.4 Shortest Path Planning 
7.8.5 Image Annotation 
7.8.6 Contingency Table Loss 
7.9 Optimization 

7.9.1 Column Generation 


subdifferentials 


7.9.2 Bundle Methods 
7.9.8 Overrelazxation in the Dual 


when we cannot do things exactly 
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7.10 CRFs vs Structured Large Margin Models 
7.10.1 Loss Function 

7.10.2 Dual Connections 

7.10.3 Optimization 

Problems 


Problem 7.1 (Deriving the Margin {1}) Show that the distance of a 
point x; to a hyperplane H = {a|(w,x)+b=0} is given by |(w, aj) + 


b|/ lull. 
Problem 7.2 (SVM without Bias {1}) A homogeneous hyperplane is one 
which passes through the origin, that is, 

= {a| (we, 2) =O} (7.82) 


If we devise a soft margin classifier which uses the homogeneous hyperplane 
as a decision boundary, then the corresponding primal optimization problem 
can be written as follows: 


6 i ae rats 
min —||w||* +C i 7.83a 
nin pila +0 Ds (7.83a) 
s.t. ys (w,x;) >1—& for alli (7.83b) 
ca 0, (7.83c) 


Derive the dual of (7.83) and contrast it with (7.9). What changes to the 
SMO algorithm would you make to solve this dual? 


Problem 7.3 (Deriving the simplified v-SVM dual {2}) In Lemma 7.2 
we used (7.41) to show that the constraint )>,a; > 1 can be replaced by 
2; % = 1. Show that an equivalent way to arrive at the same conclusion is 
by arguing that the constraint p > 0 is redundant in the primal (7.40). Hint: 
Observe that whenever p < 0 the objective function is always non-negative. 
On the other hand, setting w = € = b= p= 0 yields an objective function 
value of 0. 


Problem 7.4 (Fenchel and Lagrange Duals {2}) We derived the La- 
grange dual of (7.12) in Section 7.1 and showed that it is (7.9). Derive the 
Fenchel dual of (7.12) and relate it to (7.9). Hint: See theorem 3.3.5 of 
[BLOO}. 
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Problem 7.5 (Dual of the square hinge loss {1}) The analog of (7.5) 
when working with the square hinge loss is the following 


: 1 2 C . 2 
min 5llwll’ + = 8 (7.84a) 
s.t. y((w, a4) +b) > 1— & for allt (7.84b) 
& = 0, (7.84c) 


Derive the Lagrange dual of the above optimization problem and show that 
it a Quadratic Programming problem. 


Problem 7.6 (Dual of the ramp loss {1}) Derive the Lagrange dual of 
(7.49) and show that it the Quadratic Programming problem (7.50). 


Problem 7.7 (SMO for various SVM formulations {2}) Derive an SMO 
like decomposition algorithm for solving the dual of the following problems: 


° v-SVM (7.41). 
e SV regression (7.57). 
e SV novelty detection (7.71). 


Problem 7.8 (Novelty detection with Balls {2}) Jn Section 7.4 we as- 
sumed that we wanted to estimate a halfspace which contains a major frac- 
tion of the input data. An alternative approach is to use balls, that is, we 
estimate a ball of small radius in feature space which encloses a majority of 
the input data. Write the corresponding optimization problem and its dual. 
Show that if the kernel is translation invariant, that is, k(x, x’) depends only 
on ||a — 2’|| then the optimization problem with balls is equivalent to (7.71). 
Explain why this happens geometrically. 


Problem 7.9 (Multiclass and Multilabel loss from Ranking Loss {1}) 
Show how the multiclass (resp. multilabel) losses (7.74) and (7.75) (resp. 
(7.77) and (7.79)) can be derived as special cases of (7.80) and (7.81) re- 
spectively. 


Problem 7.10 Invariances (basic loss) 


Problem 7.11 Polynomial transformations - SDP constraints 


Appendix 1 


Linear Algebra and Functional Analysis 


A1.1 Johnson Lindenstrauss Lemma 


Lemma 1.1 (Johnson Lindenstrauss) Let X be a set of n points in R@ 
represented asan xd matrix A. Given «, 8 > 0 let 


logn (1.1) 


be a positive integer. Construct ad x k random matrix R with independent 
standard normal random variables, that is, Rij ~ N(0,1), and let 


p= AR (1.2) 


Define f : R? > R* as the function which maps the rows of A to the rows 
of E. With probability at least 1—n~®, for all u,v € X we have 
(1—¢)|lu— ol? < If) — Fe)? < +6) lu— ol). (1.3) 


Our proof presentation by and large follows [?]. We first show that 


Lemma 1.2 For any arbitrary vector a € R®@ let q denote the i-th compo- 
nent of f(a). Then q ~ N(0, |||? /k) and hence 


E[ILF(@)IP] = D2 [a] = lla. (1.4) 


i=1 

In other words, the expected length of vectors are preserved even after em- 
bedding them in a & dimensional space. Next we show that the lengths of 
the embedded vectors are tightly concentrated around their mean. 
Lemma 1.3 For any € > 0 and any unit vector a € R? we have 


Pr([f(a)I? > 1 +6) < exp (-§ (¢/2-4/3)) (1.5) 
Pr (ira)? ete c) < exp (-5 (2 /2- </3)) , (1.6) 
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Corollary 1.4 If we choose k as in (1.1) then for any a € R® we have 


2 
Pr((1—ollal? <If(@)I? < +l?) =1-—y. 2.7) 
Proof Follows immediately from Lemma 1.3 by setting 
2exp (-5 (e /2—€ /3)) < netB? 
and solving for k. | 


There are i pairs of vectors u,v in X, and their corresponding distances 
||w — v|| are preserved within 1 + € factor as shown by the above lemma. 
Therefore, the probability of not satisfying (1.3) is bounded by (5) : Hep < 


1/n? as claimed in the Johnson Lindenstrauss Lemma. All that remains is 


to prove Lemma 1.2 and 1.3. 

Proof (Lemma 1.2). Since g; = a >); Rijaj is a linear combination of stan- 
dard normal random variables Rj; it follows that q; is normally distributed. 
To compute the mean note that 


1 
E [gi] = Ti day E [Rij] = 0. 


Since R;; are independent zero mean unit variance random variables, E [Rij Ria] = 
1 if 7 =1 and 0 otherwise. Using this 


2 


d d d d 
1 1 1 1 
E [a] = i E win = i So So ajay E [Ri Rial = zk So a5 = k lla||? . 
j=l j=l l=1 


Proof (Lemma 1.3). Clearly, for all A 


Pr |||f(@)|? > 1+ ¢] = Pr [exp (Allf(@)I?) > exp +). 
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Using Markov’s inequality (Pr|X > a] < ELX]/a) we obtain 


E [exp (Allf(@) I?) 
exp(M(1 + €)) 


Pr [exp (Allf(@)I?) > exp. + 6))| < 


E [TTL exp (Ag?) | 
exp(A(I + 6) 

_ (Elen (@)] \" 

- (as Cai) | , oS 


The last equality is because the q;’s are i.i.d. Since @ is a unit vector, from 
the previous lemma q; ~ N(0,1/k). Therefore, kq? is a y? random variable 
with moment generating function 


1 
E [exp (\q7) | =E lexp (Za) = ———. 
-® 


Plugging this into (1.8) 


k 


Pr [exp (Allf(a)I?) > exp (Al. +6))] < [ eS ? 
k 


Setting A = HTS) a7) in the above inequality and simplifying 


Pr [exp (Allf(@)II’) > exp(a(l +6))] < (exp(-e)(1+ 6)”. 
Using the inequality 
log(1 +e) <e—€7/2+6°/3 
we can write 


Pr [exp (Allf(o DI *) > exp(A (1+.))| < exp (-5 (2/2- 6/3). 


This proves (1.5). To prove (1.6) we need to repeat the above steps and use 
the inequality 


leet — 2) oe" /2, 


This is left as an exercise to the reader. |_| 
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A1.2 Spectral Properties of Matrices 
A1.2.1 Basics 
A1.2.2 Special Matrices 


unitary, hermitean, positive semidefinite 


A1.2.38 Normal Forms 


Jacobi 


A1.3 Functional Analysis 
A1.3.1 Norms and Metrics 


vector space, norm, triangle inequality 


A1.3.2 Banach Spaces 


normed vector space, evaluation functionals, examples, dual space 


A1.3.3 Hilbert Spaces 


symmetric inner product 


A1.3.4 Operators 


spectrum, norm, bounded, unbounded operators 


A1.4 Fourier Analysis 
A1.4.1 Basics 
A1.4.2 Operators 
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202 2 Conjugate Distributions 


Binomial — Beta 
o(x) =a 
site) 2 SE) peg aarp gp 


I'(n + 2) 
In traditional notation one represents the conjugate as 


— Po+ B) e-ayy _ ,)6-1 
“rare ¢~?) 


where a=nv+1and 8 =n(1— bv) +1. 
Multinomial — Dirichlet 


d(x) = ex 


eh(nv,n) = gan Pn + 1) 
I(n + d) 


p(z; a, B) 


In traditional notation one represents the conjugate as 


d d 
_ PQvin1 %) Ue" 
=— 
Ties P(aj) i=1 
where a; = ny; +1 
Poisson — Gamma 


o(@) = 


ern) — 2-’P (ny) 

In traditional notation one represents the conjugate as 
p(za) = BT (az te 

where a = nv and BG =n. 


Multinomial / Binomial 
Gaussian 

Laplace 

Poisson 

Dirichlet 

Wishart 

Student-t 

Beta 

Gamma 
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Loss Functions 


A3.1 Loss Functions 


A multitude of loss functions are commonly used to derive seemingly differ- 
ent algorithms. This often blurs the similarities as well as subtle differences 
between them, often for historic reasons: Each new loss is typically accompa- 
nied by at least one publication dedicated to it. In many cases, the loss is not 
spelled out explicitly either but instead, it is only given by means of a con- 
strained optimization problem. A case in point are the papers introducing 
(binary) hinge loss [BM92, CV95] and structured loss [TGK04, TJHA05]. 
Likewise, a geometric description obscures the underlying loss function, as 
in novelty detection [SPST* 01]. 

In this section we give an expository yet unifying presentation of many 
of those loss functions. Many of them are well known, while others, such 
as multivariate ranking, hazard regression, or Poisson regression are not 
commonly used in machine learning. Tables A3.1 and A3.1 contain a choice 
subset of simple scalar and vectorial losses. Our aim is to put the multitude 
of loss functions in an unified framework, and to show how these losses 
and their (sub)gradients can be computed efficiently for use in our solver 
framework. 

Note that not all losses, while convex, are continuously differentiable. In 
this situation we give a subgradient. While this may not be optimal, the 
convergence rates of our algorithm do not depend on which element of the 
subdifferential we provide: in all cases the first order Taylor approximation 
is a lower bound which is tight at the point of expansion. 

In this setion, with little abuse of notation, v; is understood as the i-th 
component of vector v when v is clearly not an element of a sequence or a 
set. 


A3.1.1 Scalar Loss Functions 


It is well known [Wah97] that the convex optimization problem 


min € subject to y(w,x) >1—€ andE>0 (3.1) 
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Logistic |CS500} log(1 + exp(—yf )) —y/(1 + exp(—-yf)) 

Novelty [SPST*01] max(0, p — f) 0 if f > p and —1 otherwise 
Least mean squares [Wil98] 5(f —y)? {= 

Least absolute deviation lf —y| sign(f — y) 


Quantile regression [/<oe05] max(t(f =9),(l=r)iv=7)) 


7 if f > y and 7 — 1 otherwise 


e-insensitive [VGS97] max(0,|f — y| — €) 


0 if |f —y| <«, else sign(f — y) 


Huber’s robust loss [MSR*97] 5(f — y)* if |f —y| <1, else |f —y|— 5 f—y if |f —y| <1, else sign(f — y) 


Poisson regression [Cre93] exp(f) — yf 


exp(f) — y 


Vectorial loss functions and their derivatives, depending on the vector f := Wa and on y. 


Loss 


Soft-Margin Multiclass [TGK04] max,(fy — fy + A(y, y’)) 
[CS03] 


Derivative 


eye = By 
where y* is the argmax of the loss 


Scaled Soft-Margin Multiclass maxy T(y, y’)( fy — fy + Ay, y’)) 
[TJHA05] 


Ty, y')(eys — ey) 
where y* is the argmax of the loss 


Softmax Multiclass [CDLS99] log >7 exP(fy) — fy 


[oy ev exp(fy)] / Dy exp(Sy) — ey 


Multivariate Regression 5(f —y)'M(f —y) where M > 0 


M(f —y) 
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takes on the value max(0,1 — y(w,2)). The latter is a convex function in 
w and «x. Likewise, we may rewrite the ¢-insensitive loss, Huber’s robust 
loss, the quantile regression loss, and the novelty detection loss in terms of 
loss functions rather than a constrained optimization problem. In all cases, 
(w, x) will play a key role insofar as the loss is convex in terms of the scalar 
quantity (w,z). A large number of loss functions fall into this category, 
as described in Table A3.1. Note that not all functions of this type are 
continuously differentiable. In this case we adopt the convention that 


Oxf (x) if f(a) > g(x) 


(3.2) 
Ozg(x) otherwise . 


dz max( f(x), g(x)) = 
Since we are only interested in obtaining an arbitrary element of the subd- 
ifferential this convention is consistent with our requirements. 

Let us discuss the issue of efficient computation. For all scalar losses we 
may write I(x, y,w) = 1((w, x) ,y), as described in Table A3.1. In this case a 
simple application of the chain rule yields that 0,,1(a, y, w) = l/((w, x) ,y)-. 
For instance, for squared loss we have 


I((w, x) ,y) = 3((w, 2) — y)” and U((w, x) ,y) = (w, 2) — y. 


Consequently, the derivative of the empirical risk term is given by 
1 m 
OwRemp(w) = ma DP ((w, 2) Yi) + Vi. (3.3) 
= 


This means that if we want to compute / and 0,/ on a large number of 
observations x;, represented as matrix X, we can make use of fast linear 
algebra routines to pre-compute the vectors 


f =Xw and g! X where g; = I'(fi, yi). (3.4) 


This is possible for any of the loss functions listed in Table A3.1, and many 
other similar losses. The advantage of this unified representation is that im- 
plementation of each individual loss can be done in very little time. The 
computational infrastructure for computing Xw and g'X is shared. Eval- 
uating I(f;,y;) and l/(f;,y;) for all i can be done in O(m) time and it is 
not time-critical in comparison to the remaining operations. Algorithm 3.1 
describes the details. 

An important but often neglected issue is worth mentioning. Computing f 
requires us to right multiply the matrix X with the vector w while computing 
g requires the left multiplication of X with the vector g'. If X is stored ina 
row major format then Xw can be computed rather efficiently while g' X is 
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Algorithm 3.1 ScalarLoss(w, X, y) 

: input: Weight vector w, feature matrix X, and labels y 
: Compute f = Xw 

: Compute r = )>; I(fi, yi) and g =l'(f,y) 

ge g'X 

: return Risk r and gradient g 


oR Ww Ny 


expensive. This is particularly true if X cannot fit in main memory. Converse 
is the case when X is stored in column major format. Similar problems are 
encountered when X is a sparse matrix and stored in either compressed row 
format or in compressed column format. 


A3.1.2 Structured Loss 


In recent years structured estimation has gained substantial popularity in 
machine learning [TJHA05, TGK04, BHS*07]. At its core it relies on two 
types of convex loss functions: logistic loss: 


U(a,y,w) = log S~ exp ((w, o(2,y'))) — (w, (x,y), (3.5) 


ye" 


and soft-margin loss: 
i(z,y, w) = maxD(y,y') (w, d(x, y') — (x, y)) + Alyy’). (3.6) 


Here ¢(x,y) is a joint feature map, A(y,y’) > 0 describes the cost of mis- 
classifying y by y’, and T'(y, y’) > 0 is a scaling term which indicates by how 
much the large margin property should be enforced. For instance, [TGK04] 
choose T'(y, y’) = 1. On the other hand [TJHA05] suggest ['(y, y’) = A(y, y’), 
which reportedly yields better performance. Finally, [\icA07] recently sug- 
gested generic functions [(y, y’). 

The logistic loss can also be interpreted as the negative log-likelihood of 
a conditional exponential family model: 


p(y|z; w) = exp((w, (x, y)) — g(w|z)), (3.7) 


where the normalizing constant g(w|x), often called the log-partition func- 
tion, reads 


g(w|x) := log S> exp (Cw, o(a, y'))) : (3.8) 


y’ed 
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As a consequence of the Hammersley-Clifford theorem |Jor08] every expo- 
nential family distribution corresponds to a undirected graphical model. In 
our case this implies that the labels y factorize according to an undirected 
graphical model. A large number of problems have been addressed by this 
setting, amongst them named entity tagging [L)P01], sequence alignment 
[TJHAO5], segmentation [RSS*07] and path planning [RBZO06]. It is clearly 
impossible to give examples of all settings in this section, nor would a brief 
summary do this field any justice. We therefore refer the reader to the edited 
volume [BHS* 07] and the references therein. 

If the underlying graphical model is tractable then efficient inference al- 
gorithms based on dynamic programming can be used to compute (3.5) and 
(3.6). We discuss intractable graphical models in Section A3.1.2.1, and now 
turn our attention to the derivatives of the above structured losses. 

When it comes to computing derivatives of the logistic loss, (3.5), we have 

Ly o(a,y/exp (w, 6(2, ¥')) 


Dally) = eT aay oe) G9) 


= Ey wp(y'|z) [d(2, y')| ~~ o(z, y). (3.10) 


where p(y|a) is the exponential family model (3.7). In the case of (3.6) we 
denote by y(a) the argmax of the RHS, that is 


g(a) = argmaxT(y,y') (w, d(x, y') — o(a,y)) + Ay’). (3.11) 
y 


This allows us to compute the derivative of I(x, y, w) as 


In the case where the loss is maximized for more than one distinct value y(x) 
we may average over the individual values, since any convex combination of 
such terms lies in the subdifferential. 

Note that (3.6) majorizes A(y,y*), where y* := argmaxy (w, ¢(2, y’)) 
[TJHA05]. This can be seen via the following series of inequalities: 


A(y,y*) <T(y,y*) (w, o(@, y*) — o(2, y)) + Aly, y*) < Ua, y, w). 


The first inequality follows because ['(y, y*) > 0 and y* maximizes (w, (2, y’)) 
thus implying that ['(y, y*) (w, d(2, y*) — d(a,y)) > 0. The second inequal- 
ity follows by definition of the loss. 

We conclude this section with a simple lemma which is at the heart of 
several derivations of [Joa05]. While the proof in the original paper is far 
from trivial, it is straightforward in our setting: 
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Lemma 3.1 Denote by 6(y,y’) a loss and let ¢(x;,y;) be a feature map for 
observations (x;,y;) with 1 <i < m. Moreover, denote by X,Y the set of 
all m patterns and labels respectively. Finally let 


m m 


(X,Y) := S > O(a, yi) and A(Y,Y") := S- 5(yis ¥i)- (3.13) 


i=1 wl 


Then the following two losses are equivalent: 


d_ max (w, o(xi,y') _ (i, yi)) ar O(yis y') and cas (w, ®(X, Y’) = G(X, Y)) AW, Ys 
i=1 


This is immediately obvious, since both feature map and loss decompose, 
which allows us to perform maximization over Y’ by maximizing each of its 
m components. In doing so, we showed that aggregating all data and labels 
into a single feature map and loss yields results identical to minimizing 
the sum over all individual losses. This holds, in particular, for the sample 
error loss of [.Joa05]. Also note that this equivalence does not hold whenever 
T'(y, y’) is not constant. 


A8$.1.2.1 Intractable Models 


We now discuss cases where computing I(z, y, w) itself is too expensive. For 
instance, for intractable graphical models, the computation of )/,, exp (w, $(x, y)) 
cannot be computed efficiently. [VWVJ03] propose the use of a convex majoriza- 
tion of the log-partition function in those cases. In our setting this means 
that instead of dealing with 


U(a,y,w) = g(w|x) — (w, o(a,y)) where g(wlx) := log) > exp (w, o(a,y)) 
: (3.14) 


one uses a more easily computable convex upper bound on g via 


sup (w, [L) “1 AGgusal pla): (3.15) 
pwEMARG(cz) 


Here MARG(z) is an outer bound on the conditional marginal polytope 
associated with the map ¢(2,y). Moreover, H@auss(u|x) is an upper bound 
on the entropy by using a Gaussian with identical variance. More refined 
tree decompositions exist, too. The key benefit of our approach is that the 
solution y of the optimization problem (3.15) can immediately be used as a 
gradient of the upper bound. This is computationally rather efficient. 
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Likewise note that [TGI04] use relaxations when solving structured esti- 
mation problems of the form 


I(x, y,w) = maxT(y, y') (w, (2, y’) — o(a,y)) + Ay, y’), (3.16) 


by enlarging the domain of maximization with respect to y’. For instance, 
instead of an integer programming problem we might relax the setting to 
a linear program which is much cheaper to solve. This, again, provides an 
upper bound on the original loss function. 

In summary, we have demonstrated that convex relaxation strategies are 
well applicable for bundle methods. In fact, the results of the corresponding 
optimization procedures can be used directly for further optimization steps. 


A3.1.3 Scalar Multivariate Performance Scores 

We now discuss a series of structured loss functions and how they can be 
implemented efficiently. For the sake of completeness, we give a concise rep- 
resentation of previous work on multivariate performance scores and ranking 
methods. All these loss functions rely on having access to (w, x), which can 
be computed efficiently by using the same operations as in Section A3.1.1. 


A8.1.8.1 ROC Score 


Denote by f = Xw the vector of function values on the training set. It is 
well known that the area under the ROC curve is given by 


AUC(a, y, w) = S> I((w, as) < (w,23)), (3.17) 


where m+ and m_ are the numbers of positive and negative observations 
respectively, and I(-) is indicator function. Directly optimizing the cost 1 — 
AUC(z,y,w) is difficult as it is not continuous in w. By using max(0, 1 + 
(w, %; — x;)) as the surrogate loss function for all pairs (7, 7) for which y; < y; 
we have the following convex multivariate empirical risk 


1 1 


M4M— Mim— 


Regt) = oe masx(0,1 + f= f7). 


Yi<yj 


max(0,1—- (w,27—%;)) = 
Yi<yj 
(3.18) 


Obviously, we could compute Remp(w) and its derivative by an O(m?) op- 
eration. However [Joa05] showed that both can be computed in O(m log m) 
time using a sorting operation, which we now describe. 

Denote by c= f — SY an auxiliary variable and let 2 and 7 be indices such 
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Algorithm 3.2 ROCScore(X, y, w) 


1: input: Feature matrix X, labels y, and weight vector w 

2: initialization: s_ = m_ and s; =0 and! =0,, and c= Xw — sy 
3: m < {1,...,m} sorted in ascending order of c 

4: for 1 = 1 tom do 

5 if yz, =—1 then 

6: la Sy ands. — 3. — 1 

7 else 

8: begs <= BL Bd Ss Sy 1 

9: end if 

10: end for 

1: Rescale 1 + 1/(m4m_) and compute r = (1,c) and g =1' X. 


a 
iw) 


: return Risk r and subgradient g 


that y; = —1 and y; = 1. It follows that c; — cj = 1+ f; — f;. The efficient 
algorithm is due to the observation that there are at most m distinct terms 
Cr, k=1,...,m, each with different frequency /, and sign, appear in (3.18). 
These frequencies J, can be determined by first sorting c in ascending order 
then scanning through the labels according to the sorted order of c and 
keeping running statistics such as the number s_— of negative labels yet to 
encounter, and the number s+ of positive labels encountered. When visiting 
Yk, we know cz should appears s+ (or s_) times with positive (or negative) 
sign in (3.18) if y, = —1 (or yz, = 1). Algorithm 3.2 spells out explicitly how 
to compute Remp(w) and its subgradient. 


A3.1.8.2 Ordinal Regression 


Essentially the same preference relationships need to hold for ordinal re- 
gression. The only difference is that y; need not take on binary values any 
more. Instead, we may have an arbitrary number of different values y; (e.g., 
1 corresponding to ’strong reject’ up to 10 corresponding to ’strong accept’, 
when it comes to ranking papers for a conference). That is, we now have 
yi € {1,...,n} rather than y; € {+1}. Our goal is to find some w such that 
(w, x; — xj) < 0 whenever y; < y;. Whenever this relationship is not satis- 
fied, we incur a cost Cy, y;) for preferring x; to 7;. For examples, C'(y;, y;) 
could be constant i.e., C(y%:, yj) = 1 [Joa06] or linear i.e., C(yi,y;) = yj —Y- 

Denote by m; the number of x; for which y; = 7. In this case, there are 
M =m? — 37, m? pairs (y;,y;) for which y; 4 y;; this implies that there 
are M = M/2 pairs (y;,y;) such that y; < y;. Normalizing by the total 
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number of comparisons we may write the overall cost of the estimator as 


— Sy Cys, yj )I((w, xi) > (w,2;)) where M = ; [ns — Smt . (3.19) 
Yi<Uj a 


Using the same convex majorization as above when we were maximizing the 
ROC score, we obtain an empirical risk of the form 


1 
Rena) = aE S° Cy g7) max( 0, L- u,a = @;)) (3.20) 
Yi<Yj 


Now the goal is to find an efficient algorithm for obtaining the number of 
times when the individual losses are nonzero such as to compute both the 
value and the gradient of Remp(w). The complication arises from the fact 
that observations x; with label y; may appear in either side of the inequality 
depending on whether y; < y; or yj > yj. This problem can be solved as 
follows: sort f = Xw in ascending order and traverse it while keeping track 
of how many items with a lower value y; are no more than | apart in terms 
of their value of f;. This way we may compute the count statistics efficiently. 
Algorithm 3.3 describes the details, generalizing the results of [Joa06]. Again, 
its runtime is O(m log m), thus allowing for efficient computation. 


A3.1.8.8 Preference Relations 


In general, our loss may be described by means of a set of preference relations 
j > i for arbitrary pairs (i,7) € {1,...m}? associated with a cost C(i, 7) 
which is incurred whenever 7 is ranked above j. This set of preferences may 
or may not form a partial or a total order on the domain of all observations. 
In these cases efficient computations along the lines of Algorithm 3.3 exist. 
In general, this is not the case and we need to rely on the fact that the set 
P containing all preferences is sufficiently small that it can be enumerated 
efficiently. The risk is then given by 


Fi SC, s)U((w, ai) > (w,23)) (3.21) 
(4,j)EP 
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Algorithm 3.3 OrdinalRegression(X, y, w, C) 
1: input: Feature matrix X, labels y, weight vector w, and score matrix C 
2: initialization: | = 0, and u; = m; Vi € [n] and r= 0 and g = 0, 
3: Compute f = Xw and set c = [f — 4, f + 3] € R?” (concatenate the 
vectors) 
4: Compute M = (m2 — >7_, m?)/2 
5: Rescale C + C/M 
6: m+ {1,...,2m} sorted in ascending order of c 
i 
8 
9 


: for i= 1 to 2m do 
j =; mod m 
if 7; <m then 


10: for k =1 to y; —1 do 
11: ror—Clk, y;)upe; 
12: 95 — 95 — CR, yj) Ur 
13: end for 

14: ly, ly, +1 

15: else 

16: for k = y; +1 to n do 
17: THT + Oly lei 
18: 95 — OF Clyj, k)lz 
19: end for 

20: Uy; — Uy, — 1 


21: end if 

22: end for 

23: ge gl X 

24: return: Risk r and subgradient g 


Again, the same majorization argument as before allows us to write a convex 
upper bound 


1 a 
Remp(w) = Pl bs C(i, j) max (0,1+ (w, ai) — (w,2;)) (3.22) 
(i,j) EP 
1 0 if ;— 2) >1 
where 0yRemp(w) = — > C(i, 7) ; ue vi) 2 
|P| (ier x;—2; otherwise 


(3.23) 


The implementation is straightforward, as given in Algorithm 3.4. 
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Algorithm 3.4 Preference(X, w,C, P) 
1: input: Feature matrix X, weight vector w, score matrix C’, and prefer- 


ence set P 

: initialization: r = 0 and g = 0,, 

: Compute f = Xw 

: while (7,7) € P do 

if f; — fi <1 then 
rere, j(l+fi-f) 
9 — 9 + Cli, 7) and gj — g3 — C(t, 9) 

end if 

: end while 

-ge gx 

: return Risk r and subgradient g 


Se oe 
KH Oo 


A8.1.3.4 Ranking 


In webpage and document ranking we are often in a situation similar to that 
described in Section A3.1.3.2, however with the difference that we do not 
only care about objects x; being ranked according to scores y; but moreover 
that different degrees of importance are placed on different documents. 

The information retrieval literature is full with a large number of differ- 
ent scoring functions. Examples are criteria such as Normalized Discounted 
Cumulative Gain (NDCG), Mean Reciprocal Rank (MRR), Precision@n, or 
Expected Rank Utility (ERU). They are used to address the issue of evaluat- 
ing rankers, search engines or recommender sytems [Voo01, JIK02, BHK98, 
BHO04]. For instance, in webpage ranking only the first k retrieved docu- 
ments that matter, since users are unlikely to look beyond the first k, say 
10, retrieved webpages in an internet search. [LS07] show that these scores 
can be optimized directly by minimizing the following loss: 


W(X, y,w) = max > Cj (w, Di) = i) + (a —a(n), b(y)). (3.24) 


Here c; is a monotonically decreasing sequence, the documents are assumed 
to be arranged in order of decreasing relevance, 7 is a permutation, the 
vectors a and b(y) depend on the choice of a particular ranking measure, and 
a(z) denotes the permutation of a according to 7. Pre-computing f = Xw 
we may rewrite (3.24) as 


I(f,y) =max [eT f(r) — a(n)"bw)] eT f+aTb(y) (3.25) 
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Algorithm 3.5 Ranking(X, y, w) 

: input: Feature matrix X, relevances y, and weight vector w 
Compute vectors a and b(y) according to some ranking measure 
: Compute f = Xw 

: Compute elements of matrix Ci = cf; — bia; 


mw = LinearAssignment(C) 
:r=cl(f(m)— f)+(a—a(m))'b 
:g=—c(n'!)—candg¢ g'X 

: return Risk r and subgradient g 


orton fF wn e 


and consequently the derivative of I[(X,y,w) with respect to w is given by 
Awl(X, y, w) = (c(#7+) — ec) X where 7 = argmaxc! f(r) — a(x) 'b(y). 
(3.26) 


Here x~! denotes the inverse permutation, such that som! = 1. Finding the 
permutation maximizing c! f (7) —a(z)!b(y) is a linear assignment problem 
which can be easily solved by the Hungarian Marriage algorithm, that is, 
the Kuhn-Munkres algorithm. 

The original papers by [Kuh55] and [Mun57] implied an algorithm with 
O(m*) cost in the number of terms. Later, [Kar80] suggested an algorithm 
with expected quadratic time in the size of the assignment problem (ignor- 
ing log-factors). Finally, [OL93] propose a linear time algorithm for large 
problems. Since in our case the number of pages is fairly small (in the order 
of 50 to 200 per query) the scaling behavior per query is not too important. 
We used an existing implementation due to [.J V87]. 

Note also that training sets consist of a collection of ranking problems, 
that is, we have several ranking problems of size 50 to 200. By means of 
parallelization we are able to distribute the work onto a cluster of worksta- 
tions, which is able to overcome the issue of the rather costly computation 
per collection of queries. Algorithm 3.5 spells out the steps in detail. 


A3.1.8.5 Contingency Table Scores 


[JoaQ5] observed that Fg scores and related quantities dependent on a con- 
tingency table can also be computed efficiently by means of structured es- 
timation. Such scores depend in general on the number of true and false 
positives and negatives alike. Algorithm 3.6 shows how a corresponding em- 
pirical risk and subgradient can be computed efficiently. As with the pre- 
vious losses, here again we use convex majorization to obtain a tractable 
optimization problem. 
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Given a set of labels y and an estimate y’, the numbers of true positives 
(T.), true negatives (T_), false positives (F',), and false negatives (F_) are 
determined according to a contingency table as follows: 


y>O y<O0 
y >0 Ty Fy 
¥<0 FO T. 


In the sequel, we denote by m_ = 7T,+F_ and m_ = T_+ Fy, the numbers 
of positives and negative labels in y, respectively. We note that Fg score can 
be computed based on the contingency table [.Joa05] as 


F(T, T-) = + OF 


= . 3.27 
Ti +m_—T_+ 62m, ( ) 


If we want to use (w, x;) to estimate the label of observation x;, we may use 
the following structured loss to “directly” optimize w.r.t. Fg score [.Joa05]: 


U(X, y,w) = max |(y! 9)" f+ APs, T-)), (3.28) 


where f = Xw, A(T,,T_) := 1— Fe(T,,T_), and (T,,T_) is determined 
by using y and y’. Since A does not depend on the specific choice of (y, y’) 
but rather just on which sets they disagree, | can be maximized as follows: 
Enumerating all possible m4m_ contingency tables in a way such that given 
a configuration (T;,7_), T; (T_) positive (negative) observations x; with 
largest (lowest) value of (w,2z;) are labeled as positive (negative). This is 
effectively implemented as a nested loop hence run in O(m7?) time. Algorithm 
3.6 describes the procedure in details. 


A3.1.4 Vector Loss Functions 


Next we discuss “vector” loss functions, 7.e., functions where w is best de- 
scribed as a matrix (denoted by W) and the loss depends on Wz. Here, we 
have feature vector « € R@, label y € R*, and weight matrix W € R¢?*. We 


R™*<¢ as a matrix of m feature vectors 2, 


also denote feature matrix X € 
and stack up the columns W; of W as a vector w. 

Some of the most relevant cases are multiclass classification using both 
the exponential families model and structured estimation, hierarchical mod- 
els, z.e., ontologies, and multivariate regression. Many of those cases are 


summarized in Table A3.1. 
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Algorithm 3.6 Fg(X, y, w) 

: input: Feature matrix X, labels y, and weight vector w 
: Compute f = Xw 

nt < {i:y; = 1} sorted in descending order of f 
am < {i: y; = —1} sorted in ascending order of f 
Let po = 0 and p; = esi fat) i beer ee 

: Let no = 0 and n; = ani fas t=1,.4.470s 

y' + —y and r + —oo 

: for i =0 to m, do 

for 7 = 0 to m_ do 

10: Timp = A(t, j) — pi + nj 

11: if rimp > 7 then 

12: rT <— Ttmp 

13: T, <itiandT_+¢ j 

14: end if 

15: end for 

16: end for 

17: i +1,i=1,...,T, 


CUS Abe Ss 


of 7 


18: i= + —-1,i=1,...,T_ 
19: g + (y!—y)'X 
20: return Risk r and subgradient g 


A38.1.4.1 Unstructured Setting 


The simplest loss is multivariate regression, where I(x, y,W) = $(y—«'W)' M(y— 
x'W). In this case it is clear that by pre-computing XW subsequent calcu- 
lations of the loss and its gradient are significantly accelerated. 

A second class of important losses is given by plain multiclass classification 
problems, e.g., recognizing digits of a postal code or categorizing high-level 
document categories. In this case, $(x, y) is best represented by ey ®x (using 
a linear model). Clearly we may view (w,¢(z,y)) as an operation which 
chooses a column indexed by y from xW, since all labels y correspond to 
a different weight vector W,. Formally we set (w, ¢(x, y)) = [tW],. In this 
case, structured estimation losses can be rewritten as 


\(x,y,W) = maxT(y,y') (Wy — Wy,2) + Aly, y’) (3.29) 
y 
and Owl(x,y,W) =T(y, y*)(ey* — ey) @ x. (3.30) 


Here [ and A are defined as in Section A3.1.2 and y* denotes the value of 4 
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for which the RHS of (3.29) is maximized. This means that for unstructured 
multiclass settings we may simply compute «W. Since this needs to be per- 
formed for all observations x; we may take advantage of fast linear algebra 
routines and compute f = XW for efficiency. Likewise note that comput- 
ing the gradient over m observations is now a matrix-matrix multiplication, 
too: denote by G the matrix of rows of gradients I'(yi, yj) (ey: — ey; ). Then 
Ow Remp(X,y,W) = G'X. Note that G is very sparse with at most two 
nonzero entries per row, which makes the computation of G' X essentially 
as expensive as two matrix vector multiplications. Whenever we have many 
classes, this may yield significant computational gains. 

Log-likelihood scores of exponential families share similar expansions. We 
have 


(x,y, W) = log 5 exp (w, O(a, y’)) _ (w, (x, Yy)) = log 5 exp (Wy, 2) ~~ (Wy, z) 
y! y! 


(3.31) 

diy (ey @ x) exp (Wy, £) 
Owl(x,y,W) = — Cy @ x. 3.32 
wil ) yoy eXP (Wy, 2) 2) 


The main difference to the soft-margin setting is that the gradients are 


not sparse in the number of classes. This means that the computation of 
gradients is slightly more costly. 


A8.1.4.2 Ontologies 


Fig. A3.1. Two ontologies. Left: a binary hierarchy with internal nodes {1,...,7} 
and labels {8,...15}. Right: a generic directed acyclic graph with internal nodes 
{1,...,6,12} and labels {7,...,11,13,...,15}. Note that node 5 has two parents, 
namely nodes 2 and 3. Moreover, the labels need not be found at the same level of 
the tree: nodes 14 and 15 are one level lower than the rest of the nodes. 


Assume that the labels we want to estimate can be found to belong to 
a directed acyclic graph. For instance, this may be a gene-ontology graph 
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[ABB* 00] a patent hierarchy [CH04], or a genealogy. In these cases we have a 
hierarchy of categories to which an element 2 may belong. Figure A3.1 gives 
two examples of such directed acyclic graphs (DAG). The first example is 
a binary tree, while the second contains nodes with different numbers of 
children (e.g., node 4 and 12), nodes at different levels having children (e.g., 
nodes 5 and 12), and nodes which have more than one parent (e.g., node 5). 
It is a well known fundamental property of trees that they have at most as 
many internal nodes as they have leaf nodes. 

It is now our goal to build a classifier which is able to categorize observa- 
tions according to which leaf node they belong to (each leaf node is assigned 
a label y). Denote by k + 1 the number of nodes in the DAG including the 
root node. In this case we may design a feature map ¢(y) € R*® [CH04] by 
associating with every label y the vector describing the path from the root 
node to y, ignoring the root node itself. For instance, for the first DAG in 
Figure A3.1 we have 


~(8) = (1,0, 1,0, 0,0, 1,0,0,0,0,0,0,0) and 4(13) = (0,1, 0,0, 1,0, 0,0, 0, 0,0, 1,0, 0) 


Whenever several paths are admissible, as in the right DAG of Figure A3.1 
we average over all possible paths. For example, we have 


(10) = (0.5, 0.5, 0, 1,0,0,0,0,1,0,0,0,0,0) and 4(15) = (0,1,0,0,1,0,0,0,0,0,0, 1,0, 0, 1). 


Also note that the lengths of the paths need not be the same (e.g., to 
reach 15 it takes a longer path than to reach 13). Likewise, it is natural to 
assume that A(y, y’), 7.e., the cost for mislabeling y as y’ will depend on the 
similarity of the path. In other words, it is likely that the cost for placing 
x into the wrong sub-sub-category is less than getting the main category of 
the object wrong. 

To complete the setting, note that for d(z,y) = ¢(y) ® x the cost of 
computing all labels is k inner products, since the value of (w, ¢(x, y)) for a 
particular y can be obtained by the sum of the contributions for the segments 
of the path. This means that the values for all terms can be computed by 
a simple breadth first traversal through the graph. As before, we may make 
use of vectorization in our approach, since we may compute «W € R* to 
obtain the contributions on all segments of the DAG before performing the 
graph traversal. Since we have m patterns x7; we may vectorize matters by 
pre-computing XW. 

Also note that ¢(y) — ¢(y’) is nonzero only for those edges where the paths 
for y and y’ differ. Hence we only change weights on those parts of the graph 
where the categorization differs. Algorithm 3.7 describes the subgradient and 
loss computation for the soft-margin type of loss function. 
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Algorithm 3.7 Ontology(X, y, W) 


ks 


aan . ww 


input: Feature matrix X € R™*?, labels y, and weight matrix W € 
Rexk 


: initialization: G = 0 « R™** andr =0 


: Compute f = XW and let f; = 7;W 

: fori =1tomdo 

Let D; be the DAG with edges annotated with the values of f; 
Traverse D; to find node y* that maximize sum of f; values on the 
path plus A(y, y’) 

Gi = O(y*) — (ui) 

THT +H Zy* — Zy, 
: end for 

2: gG = Gx 

: return Risk r and subgradient g 


The same reasoning applies to estimation when using an exponential fam- 


ilies model. The only difference is that we need to compute a soft-max 
over paths rather than exclusively choosing the best path over the ontol- 
ogy. Again, a breadth-first recursion suffices: each of the leaves y of the 
DAG is associated with a probability p(y|z). To obtain Ey nyj2) [6(y)] all 
we need to do is perform a bottom-up traversal of the DAG summing over 
all probability weights on the path. Wherever a node has more than one 
parent, we distribute the probability weight equally over its parents. 
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